Predicting citations in Dutch case law with natural language processing

Schepers, Iris; Medvedeva, Masha; Bruijn, Michelle; Wieling, Martijn; Vols, Michel

doi:10.1007/s10506-023-09368-5

Predicting citations in Dutch case law with natural language processing

Original Research
Open access
Published: 28 June 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence and Law Aims and scope Submit manuscript

Predicting citations in Dutch case law with natural language processing

Download PDF

2425 Accesses
6 Altmetric
Explore all metrics

Abstract

With the ever-growing accessibility of case law online, it has become challenging to manually identify case law relevant to one’s legal issue. In the Netherlands, the planned increase in the online publication of case law is expected to exacerbate this challenge. In this paper, we tried to predict whether court decisions are cited by other courts or not after being published, thus in a way distinguishing between more and less authoritative cases. This type of system may be used to process the large amounts of available data by filtering out large quantities of non-authoritative decisions, thus helping legal practitioners and scholars to find relevant decisions more easily, and drastically reducing the time spent on preparation and analysis. For the Dutch Supreme Court, the match between our prediction and the actual data was relatively strong (with a Matthews Correlation Coefficient of 0.60). Our results were less successful for the Council of State and the district courts (MCC scores of 0.26 and 0.17, relatively). We also attempted to identify the most informative characteristics of a decision. We found that a completely explainable model, consisting only of handcrafted metadata features, performs almost as well as a less well-explainable system based on all text of the decision.

From Case Law to Ratio Decidendi

Towards a machine understanding of Malawi legal text

Article 23 October 2021

Automated Extraction and Representation of Citation Network: A CJEU Case-Study

1 Introduction

With the ever-growing accessibility of case law online, it has become (almost) impossible to analyse all case law manually. One country where this problem is becoming increasingly prevalent is the Netherlands. In the past years, the percentage of decisions published online (on rechtspraak.nl) has almost doubled, from 4.1% in 2017 to 7.8% in 2021.^{Footnote 1}. This currently amounts to over 41,000 decisions per year, ranging from lower level courts, such as district courts, to higher courts, such as the courts of appeal and the Supreme Court. Even though the current case law database consists of ‘only’ around 600,000 decisions, it is already challenging for legal practitioners and researchers to find the ones relevant to their case or research. The ambition of the Dutch council for the judiciary is to implement a system in which 75% of all decisions are published.^{Footnote 2}. If no measures are taken to improve the data’s searchability, this will lead to even more problems with retrieving the relevant decisions. The increase of legal data availability calls for ways to automatically analyse this data, since doing so manually is too time-consuming.

A way to automatically analyse large amounts of textual data is by using machine learning (ML). Over the past few decades, ML techniques have been used for various tasks in the field of artificial intelligence and law. For instance, in legal outcome forecasting (as defined by Medvedeva et al. 2022), the outcome of a court case is predicted from the facts (formulated before the outcome of a case was known) of the case with the help of classification algorithms and natural language processing (NLP) techniques. Research by e.g. Medvedeva et al. (2021b) shows that the text of legal proceedings holds valuable information for this task.

Our present study investigates whether it is possible to forecast if a decision of a Dutch court will be cited in a future Dutch court decision. We use Dutch court data as a case study, as this data is available online. Fowler and Jeon (2008) have shown that case authority, which is the extent to which a decision is deemed important for settling other legal disputes, and citations are related. Consequently, when forecasting incoming citations, one may thereby forecast the authority of a decision before this authority is even acknowledged in other case law (i.e. through actual citations).

Citing case law can have different functions. In cases with a common law system, such as the United Kingdom, the law is ‘judge-made’, meaning there is no written law. The law is created and developed through court decisions. In common law countries, judges decide along the lines of earlier decisions made in similar cases (i.e. precedents). As such, citations have a different function in these countries than in countries with a civil legal culture where most of the law is codified. The Netherlands follows the civil law tradition: the law is created by a legislator, and unlike common law countries, the Netherlands does not adopt the doctrine of stare decisis. Consequently, previous cases are taken into account in the Netherlands (especially from higher courts), but judges are not obliged to follow the legal precedents. As such, the authority of cases and the function of citations differ between common law and civil law countries. In fact, since civil law countries (such as the Netherlands) are not bound by historical precedents, and thus not obliged to refer to previous decisions, one might even argue that the relationship between citations and authority is even stronger in civil law countries than in common law countries (Zweigert and Kötz 1998).

Van Opijnen (2016) states that a system for tagging the importance of decisions is essential for the accessibility of legal big data. An example of this implementation can be found in HUDOC, the European Court of Human Rights (ECtHR) online database. In this database, it is possible to filter case law by their importance levels.^{Footnote 3}. These importance levels have been decided for and have been added manually to each case that has been uploaded, which makes it easy to implement an importance filter. However, in an existing database that has no previously recorded importance levels, such as the Dutch rechtspraak.nl, this is not possible. Therefore, our study aims to contribute to the first step in implementing an authority ranking system for Dutch case law.

Contrary to the ECtHR implementation, we do not distinguish between different importance levels. Rather, we differentiate between a clearly defined class encompassing all non-authoritative decisions (receiving zero incoming citations), and a sliding scale class encompassing all other decisions that may or not be authoritative (receiving any number of citations greater than zero). We perform a binary classification task in which we forecast whether or not a decision is cited by other case law at all, thereby predicting if the decision will be non-authoritative (meaning ‘uncited’) or not. This prediction could be used to help label decisions by filtering out the (likely) non-authoritative cases. This implementation will therefore not identify the most important decisions, but it helps filter out any decisions that are certainly not important, which is especially useful when navigating through the large amounts of available data. Therefore, this system can help legal practitioners to substantially reduce the time spent on preparation for their research or case.

Besides building a model that forecasts whether or not a court decision will be cited, we also aim to gain insight into the most informative features for determining citability. We investigate whether certain words, phrases, or characteristics increase the likelihood of a decision getting cited or remaining uncited. In doing so, we hope to contribute practically to implementing a ‘(non-)authority filter’ on rechtspraak.nl.

The following section discusses prior research related to prediction and forecasting tasks and network analysis using legal data. Section 3 describes the data and features used in our experiments. Next, in Sect. 4 we explain the methods and setup of the experiments that we have conducted. In Sect. 5 we report the results of these experiments. Finally, we discuss these results in Sects. 6 and 7 and draw conclusions.

2 Background

Traditional research in the field of law usually consists of doctrinal analysis. Yet, in recent years, empirical methods have been used as well (Vols 2021a, b). In our work, we aim to forecast whether or not a case was cited to determine the (non-)importance of a decision by combining the knowledge gained from using legal citation analysis and machine learning techniques applied to legal data. Machine learning techniques have been used for a variety of tasks in the legal field. Some examples of these tasks are extracting and summarising the most important parts of cases (e.g., Moens et al. 1997; Pandya 2019), extracting semantic legal metadata from laws (e.g., Spinosa et al. 2009; Sleimi et al. 2018, 2021), detecting unfair clauses in terms and conditions (e.g., Lippi et al. 2019), identification of the subject of case law (e.g., Medvedeva et al. 2021a) and, as mentioned before, legal decision prediction. The latter has been a relatively common practice in the field of AI and law. It has been performed on legal data from, e.g., Chinese courts (Zhong et al. 2018), the UK Supreme Court (Strickson and De La Iglesia 2020), the French Supreme Court (Şulea et al. 2017a, b), the Supreme Court of the Philippines (Virtucio et al. 2018), the Supreme Court of the United States (Katz et al. 2017), and, most often, the European Court of Human Rights (e.g., Chalkidis et al. 2019; Medvedeva et al. 2020; Kaur and Bozic 2019; O’Sullivan and Beel 2019). An extensive overview of artificial intelligence techniques used in legal analytics can be found in Ashley (2017), and an overview of recent advances in the field is provided by Whalen (2020). A discussion of previous work about predicting court outcomes can be found in Medvedeva et al. (2022). This work indicates that legal big data suits numerous machine learning (ML) and natural language processing (NLP) techniques.

Another empirical research method that has been used in the field of law, is citation analysis. Networks can be found in any research area, including the nerve cells in the human brain, relations in society, web pages on the internet, and citations of scientific literature (Barabási and Bonabeau 2003). Researchers in numerous fields have found that many networks are not distributed randomly, but instead are commanded by a small number of nodes that make up the majority of the connections. These important nodes, also called ‘hubs’, sometimes have a seemingly unlimited number of connections that appears to have no scale. Barabási and Bonabeau (2003) state that it is important to determine if one is dealing with a scale-free network to properly understand its behaviours. In legal citation networks, we also find characteristics of a scale-free network. A legal citation network is formed by the connections between legal documents (the nodes) through citations (the edges). While a relatively small number of highly influential ‘landmark decisions’ attract a substantial number of connections, the majority of decisions do not receive any citation at all. This is supported by findings of Leitão et al. (2019), who investigated the citations over time of over 17,000 admitted cases from the European Court of Human Rights up until 2016. Both Barabási and Bonabeau (2003) and Leitão et al. (2019) state that scholars or practitioners are more likely to cite well-established or well-known documents when they cite previous sources. In the legal field, this reinforces the influence and connectivity of those landmark cases, which is also known as the rich-get-richer effect, or ‘preferential attachment’. As a result, highly cited cases become hubs within the legal citation network, shaping its structure and dynamics.

An extensive history of citation analysis in law can be found in Whalen (2016), in which different applications of network analysis on legal data are described. For instance, there has been research into the social networks of criminals, but there has also been work that views statutes, regulatory codes, or case law from a network analysis viewpoint. Leitão et al. (2019) perform an analysis of the evolution of precedents over time and attempt to explain the importance of decisions by means of the Bass model. They find that the major part of how decisions are cited can be explained by a combination of the rich-get-richer mechanism and external factors, in which the former tends to play a larger role. According to Fowler and Jeon (2008), it is possible to rank decisions of the Supreme Court of the United States on authority using citation network data. While citations can happen for different reasons, they unquestionably provide evidence for the use of a previous decision, thus making the number of incoming citations a useful quantitative measure of the usage of a decision within courts. They describe an authority score, which is based on the number of times a decision gets cited, and the quality of these citing decisions. They argue that this authority score is able to identify decisions that legal experts label as ‘landmark decisions’. Some benefits of their score are that it takes much less effort to calculate than to have an expert form an opinion and that there is no chance of a subjective bias, which a human expert might exhibit. The assigned scores even show which decisions might become important in the future. Kuppevelt and Dijck (2017) present a similar tool specifically developed for Dutch case law.

Sadl and Tarissan (2020) demonstrate the potential of using legal network analysis to study the Court of Justice of the European Union (CJEU). They are able to identify landmark decisions and crucial legal developments by using measures of centrality to reflect case importance. They detect the fluctuating importance of decisions by using complementary centrality measures, and argue that the relative in-degree score of a decision can provide a comprehensive view of the evolution of case importance. They address critiques of network analysis and conclude that it may never replace doctrinal analysis, but it can provide an objective, transparent basis for legal research. The work of Sartor et al. (2023) provides an automated extraction pipeline for CJEU case law. They present a valuable tool to create and analyse networks, and they argue that automating the process will support traditional legal research too. Derlén and Lindholm (2017) go one step beyond finding the most authoritative nodes in a network, and use several metrics on a CJEU network to determine the current precedential power of a decision to detect if it is still ‘good law’. They conclude that the metrics they use are not always compliant with the expert opinion of lawyers and that researchers should be mindful of the methods they use. As investigated by Derlén and Lindholm (2017), decisions can become redundant over time, but can also be ‘awakened’ after a while and suddenly start gathering citations years after their publication. These phenomena are called ‘Sleeping Beauties’ (Ke et al. 2015). Hernandez Serrano et al. (2020) presented an algorithm that aims to identify these decisions in CJEU case law. Their methodology is compliant with traditional network metrics, and they find that the most highly influential decisions in a network tend to go unnoticed for a longer amount of time than other decisions (almost 11 months longer).

Winkels and de Ruyter (2011) performed an analysis of case law of the Dutch Supreme Court. Their research shows that decisions cited most seem to ‘fill gaps in legislation’. This means that the decision made by the court is not covered by a piece of legislation yet, and the decision is cited often until the ‘gap’ is fixed. They also find that the most cited decisions are often about procedural law. Still, this observation may be influenced by the fact that they only analysed data from the Supreme Court. They compare their research to Fowler and Jeon (2008) and say that even though the Dutch Supreme Court cites fewer decisions than the US Supreme Court, the number of citations seems to be a good indicator of authority for Dutch case law as well. From the aforementioned studies we deduce that decisions which are not cited are less authoritative. By identifying these uncited decisions, it should be possible to filter out decisions that are less authoritative and, therefore, less interesting for legal practitioners.

Though the use of citation networks has been present in legal research, work on predicting the number of citations using machine learning has yet to be published. However, Mones et al. (2021) use a Random Forest classifier to predict links between decisions, which they find to be highly predictable. They argue that an empirical understanding of the application of legislation is essential as it not only supports equality in treatment, but also improves effectiveness and consistency. They find that the most informative factors to a prediction change over time: the content of a decision plays a smaller role over time, whereas features of the network itself grow more important to the prediction. Comparable to Sadl and Tarissan (2020), Mones et al. (2021) argue that algorithmically identifying relevant decisions could never fully replace the lawyer’s insights, but it can definitely provide useful advantages.

There is some work on the statistical ranking of Dutch decisions. Van Opijnen (2012) attempts to measure legal authority by doing an extensive citation network analysis using half a million Dutch decisions. He defines and measures legal authority in various ways, namely the number of incoming citations from other case law, the number of publications in legal journals, the number of annotations published with the decisions, and his own metric, the ‘Marc In-Degree’ (calculated as \(1 + log_2(C)\), in which C is the number of incoming citations). The author concludes that exogenous variables (e.g., incoming citations) are relevant for determining case authority and that endogenous variables he examined (e.g., the type of court or the length of the decision) by themselves are not sufficient for determining reliable results. He then builds upon these findings by creating the MARC (‘Model for Automated Ranking of Case Law’) score (Van Opijnen 2013). This model is implemented in the internal database of the Dutch judiciary to calculate an authority score for each decision. The model consists of two parts: the first part of the model analyses the decisions that have not been cited yet (the ‘publication period’), and the second part analyses the decisions that have been cited (the ‘citation period’). The score is then constantly updated based on the changing incoming citations. The first part of the statistical model is based only on several selected (primarily) endogenous variables, which he concludes to be less trustworthy than exogenous variables in his previous work Van Opijnen (2012). However, Van Opijnen (2013) concludes that even though the endogenous predictors do not add much to a model that has access to the exogenous predictors, the endogenous predictors have enough predictive value on their own. We also evaluate several of these variables in our approach to predicting whether or not a case is cited.

In the present study, we are expanding upon prior research by Van Opijnen (2012) and Van Opijnen (2013) by assessing the (non-)authority of Dutch case law. We do this by predicting whether or not rechtspraak.nl decisions are cited. For this, we solely use endogenous features from the metadata and the texts of decisions (extracted through NLP techniques), all of which are available from the moment the decisions are published. In doing so, we also aim to determine if any endogenous variables, not described by Van Opijnen, are valuable to include in determining whether or not a case is cited. Our approach is, therefore, a first step towards determining the case’s authority, as cases which are not cited are also not authoritative.

3 Data

3.1 Data collection

The data used for this study consist of Dutch case law from rechtspraak.nl. The content and metadata of all published decisions can be downloaded in XML format via Open Data van de Rechtspraak, the Open Data of the Judiciary (ODR).^{Footnote 4} The downloaded ODR dataset contains about 3,090,000 files from 1911 up to 2022, sorted per month. However, the contents of a large number of ODR files are not available to the public. Some are only available to the judiciary in a particular archive, and some publications have been revoked. These files were filtered out, thereby we use the oldest 60% of the data for our experiments

All published files containing decisions have a relatively consistent structure that can be found online in the technical documentation.^{Footnote 5}. The structure of the text of the decision itself varies slightly per court of law. Still, it usually contains an introduction, process flow, considerations, and a decision. There is, however, much variation in the aesthetic formatting, as there are likely many different editors working on these files, each using their own style conventions.

As the incoming and outgoing citations are not adequately registered in ODR, we used another governmental dataset for this, dubbed the Linked Data Overheid, ‘Linked Data Government’ (LIDO). This dataset contains all of the links between a large number of governmental web pages, which also include citations to case law. This dataset is updated monthly as well.^{Footnote 6} The citations in this dataset were extracted from the text by a sophisticated algorithm, the LinkeXtractor (Van Opijnen 2018). This algorithm recognises various citation formats but may make mistakes in rare cases. For instance, a 1905 Supreme Court decision^{Footnote 7} cites, according to the LinkeXtractor, the 2001 ECtHR decision Van den Hoogen v. the Netherlands,^{Footnote 8} which is impossible. The extractor deduced this citation from the phrase ’van den Hoogen Raad‘ (which means ‘by the Supreme Court’ in old Dutch and matches part of the name of the 2001 case). We filtered out any citations to future case law to correct these erroneous citations. We have also filtered out citations due to ‘formal relations’, i.e. , a decision by a lower or higher court in the same case. We are only interested in citations that are made because of the relevance of the content of a decision, as only these citations indicate the authority of a decision. However, we include formal relations as a feature for predicting whether or not the decisions get cited, which we elaborate on later in this Section.

3.2 Data selection

The Dutch Council for the Judiciary started publishing the data online in December 1999. We do not have access to outgoing citations from decisions that are not available online, so we chose to exclude decisions from before 1999 that have been published after their ruling date.

We focus on three types of courts: the district courts (DC), the Council of State (CS), and the Supreme Court (SC). The Supreme Court is the peak court level in private, criminal and tax cases, while the Council of State is the highest court for administrative law. In 2022, there are eleven district courts, which we combined, as they generally treat the same types of cases in first instance, and there are not enough decisions published for each court separately. Courts that were renamed or abolished in the past have also been included in this dataset. For example, there used to be one district court for the eastern part of the Netherlands, but it was later split into two district courts for the provinces of Overijssel and Gelderland. The three types of courts (DC, CS and SC) were distinguished from each other, as this allows us to compare citations regarding decisions at first instance and at their final appeal (SC/CS versus DC) and to compare between the area of law (SC versus CS). Our datasets contain decisions up to the 31st of August, 2022, which leaves us with 29,007 SC decisions, 59,356 CS decisions, and 153,735 DC decisions.

The number of citations is determined in relation to a specific time span during which the decision was cited. In Figs. 1, 2, and 3, the grey part of each bar indicates decisions that have been cited within one year, two years, five years, ten years, and the entire period available, respectively, whereas the part of the bar with diagonal lines indicates decisions that have not been cited in these time frames. The increase after ten years in the number of cases cited is relatively limited (2.7%). However, as a ten-year time span would result in a very small training set (as only cases could be selected that were published more than ten years ago), we opted for the five-year time span instead. The majority of cases which get cited in the total time since they were published also get cited in the first five years (on average across the three datasets: 84.7%). Because we forecast the number of citations for a period of five years, we exclude all cases not published at least five years ago (i.e. those published after September 1st, 2017).

We train a model by providing it with the text of the decisions and whether a decision was cited or not (i.e. the ‘labels’). The labels cannot be derived from the texts of the decisions. The model then learns what characteristics (i.e. the ‘features’) are indicative of each label. A held-out development set is used to determine the best algorithm and settings of the algorithm. After the training and development phase, we test the model on data which was excluded from this phase. The selected model thus has to apply the knowledge it has gathered during its training phase to forecast the labels of these new data. We train our model on decisions that are older than the decisions that we test on, which mirrors a real-life situation.

For our experiments, we use the oldest 60% of the data as training data. From the remaining 40%, we use the oldest half (i.e. 20%) as the development data, and the most recent half (i.e. 20%) as the test data. For the final experiment, we build the model including both the training and development data, which means we train on 80% of all the data and test on the remaining 20%. This is a common split used in machine learning, which has empirically been shown to be the best division of train and test data (Gholamy et al. 2018). The respective sizes of the datasets can be found in Table 1. The column called ‘Label’ refers to the value we are forecasting: a 0 label means that a decision received zero incoming citations, a 1 label means a decision received one or more incoming citations.

As Table 1 shows, the data is (sometimes heavily) skewed towards not being cited. To counteract this, we ran some initial experiments with weights assigned to each class in the classifier. However, the performance of the Council of State and the district courts models was very poor, with the model only predicting the label that was present more often (the uncited decisions). We therefore balanced all the training data by undersampling the majority class (for all types of cases). This means that we randomly removed decisions from the majority class (‘uncited’) until it had the same size as the minority class (‘cited’). Table 2 shows the resulting counts per dataset. We did not balance the development and test data, to still simulate a real-life scenario. Note that when adding the development set to the training set for the testing phase, the majority class was again undersampled to ensure that our training data remained balanced.

Table 1 Sizes of datasets per court prior to balancing of the training data

Predicting citations in Dutch case law with natural language processing

Abstract

Similar content being viewed by others

From Case Law to Ratio Decidendi

Towards a machine understanding of Malawi legal text

Automated Extraction and Representation of Citation Network: A CJEU Case-Study

1 Introduction

2 Background

3 Data

3.1 Data collection

3.2 Data selection

3.3 Features

3.4 Feature representation

4 Method

4.1 Algorithms

4.2 Evaluation

4.3 Baseline

4.4 Feature selection

5 Results

5.1 Determining the best configuration

5.2 Best model performance

5.3 Analysing the most informative features

5.4 Performance of highly explainable models

6 Discussion

7 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation