In this Section we describe the experiments that we conducted in this study. In Experiment 1 we investigate the possibilities of using words and phrases in the text of the cases in order to predict judicial decisions. In Experiment 2 we use the approaches from the first experiment in order to estimate the potential of predicting future cases. In Experiment 3 we test if we can make predictions based solely on objective (although limited) information. Specifically, we will evaluate how well we are able to predict the court’s judgements only using the surnames of the judges involved.
Experiment 1: textual analysis
Set-up
The data we provided to the machine learning program does not include the entire text of the court decision. Specifically, we have removed decisions and dissenting/concurring opinions from the texts of the judgements. We have also removed the Law part of the judgement as it includes arguments and discussions of the judges that partly contain the final decisions. See, for instance, the statement from the Case of Palau-Martinez v. France (16 December 2003):
50. The Court has found a violation of Articles 8 and 14, taken together, on account of discrimination suffered by the applicant in the context of interference with the right to respect for their family life.
From this sentence it is clear that in this case the Court ruled for a violation of Articles 8 and 14. Consequently, if we let our program predict the decision based on this information, it will be unfair as the text already shows the decision (‘found a violation’). Moreover, the discussions that the Law part contains are not available to the parties before the trial and therefore, predicting the judgement on the basis of this information is not very useful. Other information we have removed, is the information in the beginning of the case description which contains the names of the judges. We will, however, use this data in Experiment 3. The data we used can be grouped in five parts: Procedure, Circumstances, Relevant Law, the latter two together (Facts) and all three together (Procedure + Facts).
The important characteristic of the aforementioned parts is that they are available years before the decision is made. The Facts part of the case become available after the case is ruled to be admissible. One might argue that because the cases available on HUDOC were written after the trial, the information that was found irrelevant and was dismissed during the trial might not appear in the text. However, the Facts of the case are available and remain unchanged several years before the final judgement is made (i.e. after the application is found admissible based on the formal criteria). As part of the procedure, the Court ‘communicates’ the application to the government concerned and lists the facts stated by the applicant, in order for the government to be able to respond to the accusation. The Procedure part lists the information regarding what happened before the case was presented to the Court and thus is also available before the decision is made. The rest of the text of the case is only available when the final judgement is made by the Court, and thus should not be used to predict the decisions in advance.
Until now we have ignored one important detail, namely how the text of a case is represented for the machine learning program. For this we need to define features (i.e. an observable characteristic) of each case. In terms of the cats-and-dogs example, features of each picture would be the length of a tail as a proportion of the total body length, being furry or not, the number of legs, etc. The machine learning program then will determine which features are the most important for classification. For the cats-and-dogs example, the relative tail length and furriness will turn out to be important features in distinguishing between the two categories, whereas having four legs will not be important. An essential question then becomes how to identify useful features (and their values for each case). While it is possible to use manually created features, such as particular types of issues that were raised in the case, we may also use automatically selected features, such as those which simply contain all separate words, or short consecutive sequences of words. The machine learning program will then determine which of these words or word sequences are most characteristic for either a violation or a non-violation. A contiguous sequence of one or more words in a text is formally called a word n-gram. Single words are called unigrams, sequences of two words are bigrams, and sequences of three consecutive words are called trigrams.
For example, consider the following sentence:
By a decision of 4 March 2003 the Chamber declared this application admissible.
If we split this sentence into bigrams (i.e. 2 consecutive words) the extracted features consist of:
By a, a decision, decision of, of 4, 4 March, March 2003, 2003 the, the Chamber, Chamber declared, declared this, this application, application admissible, admissible.
Note that punctuation (e.g., a point at the end of the sentence) is also interpreted as being a word. For trigrams, the features consist of:
By a decision, a decision of, decision of 4, of 4 March, 4 March 2003, March 2003 the, 2003 the Chamber, the Chamber declared, Chamber declared this, declared this application, this application admissible, application admissible.
While we now have shown which features can be automatically extracted, we need to decide what values are associated with these features for each separate case. A very simple approach would be taking all n-grams from all cases and using a binary feature value: 1 if the n-gram is present in the case description and 0 if it is not. But of course, we then throw away useful information, such as the frequency with which an n-gram occurs in a document. While using the frequency as a feature value is certainly an improvement (e.g., ‘By a’: 100, ‘4 March’: 1, ‘never in’: 0) some words simply are more common and therefore used more frequently than other words, despite these words not being characteristic for the document at all. For example, the unigram ‘the’ will occur much more frequently than the word ‘application’. In order to correct for this, a general approach is to normalise the absolute n-gram frequency by taking into account the number of documents (i.e. cases) in which each word occurs. The underlying idea is that characteristic words of a certain case will only occur in a few cases, whereas common, uncharacteristic words will occur in many cases. This normalized measure is called term frequency-inverse document frequency (or tf-idf). We use the formula defined by scikit-learn Python package (Pedregosa et al. 2011): \({\text {tfidf}}(d, t) = {\text {tf}}(t) * {\text {idf}}(d, t)\), where \({\text {idf}}(d, t) = \log (n / {\text {df}}(d, t)) + 1\), n is the total number of documents, and \({\text {df}}(d, t)\) is the document frequency. Document frequency is the number of documents d that contain term t. In our case the terms are n-grams. So, let’s say we have a document of a 1000 words containing the word torture 3 times, the term frequency (i.e. tf) for torture is \((3 / 1000) = 0.003\). Now let’s say we have 10,000 documents (cases) and the word torture appears in 10 documentsm. Then the inverse document frequency (i.e. idf) is log\((10,000 / 10) + 1 = 4\). The resulting tf-idf score is \(0.003 * 4 = 0.012\). This is the score (i.e. weight) assigned to the word torture. Note that this score is higher than using the tf score, as it reflects that the word does not occur often in other documents.
In order to identify which sets of features we should include (e.g., only unigrams, only bigrams, only trigrams, a combination of these, or even longer n-grams) we evaluate all possible combinations. It is important to realise that longer n-grams are less likely to occur (e.g., it is unlikely that one full sentence occurs in multiple case descriptions) and therefore are less useful to include. For this reason we limit the maximum word sequence (i.e. n-gram length) to 4.
However, there are also other choices to make (i.e. parameters to set), such as if all words should be converted to lowercase, or if the capitalisation is important. For these parameters we take a similar approach and evaluate all possible combinations. All parameters we have evaluated are listed in Table 3.Footnote 13 Because we had to evaluate all possible combinations, there were a total of 4320 different possibilities to evaluate. As indicated above, cross-validation is a useful technique to assess (only on the basis of the training data) which parameters are best. To limit the computation time, we only used threefold cross-validation for each article. The program therefore trained 12,960 models. Given that we trained separate models for 5 parts of the case descriptions (Facts, Circumstances, etc.), the total number of models was 64,800 for each article and 583,200 models for all 9 articles of the ECHR. Of course we did not run all these programs manually, but rather created a computer program to conduct this so-called grid-search automatically. The best combination of parameters for each article was used to evaluate the final performance (on the test set). Table 4 shows the best settings for each article.Footnote 14
During this type of search, we identify which parameter setting performs best by testing each combination of parameters a total of 3 times (using random splits to determine the data used for training and testing) and selecting the parameter setting which achieves the highest average performance. We use this approach to make sure that we did not just get ‘lucky’, but that overall the model performs well. Of course, it is still possible that the model performs worse (or better) on the test data.
Table 3 List of evaluated parameter values Table 4 Selected parameters used for the best model For most articles unigrams achieved the highest results, but for some longer n-grams were better. As we already expected, the Facts section of the case was the most informative and selected for 8 out of 9 articles. For many articles the Procedure section was also informative. This is not surprising, as Procedure contains important information on the alleged violations. See, for instance, a fragment from the procedure part of the Case of Abubakarova and Midalishova v. Russia (4 April 2017):
3. The applicants alleged that on 30 September 2002 their husbands had been killed by military servicemen in Chechnya and that the authorities had failed to investigate the matter effectively.
Results
After investigating which combinations of parameters worked best, we used these parameter settings together with tenfold cross-validation to ensure that the model performed well in general and was not overly sensitive to the specific set of cases on which it was trained. When performing tenfold cross-validation instead of threefold cross-validation, there is more data available to use for training in each fold (i.e. 90% rather than 66.7%). The results can be found under ‘cross-val’ in Table 5. Note that as we used a balanced dataset (both for the cross-validation and the test set), the number of ‘violation’ cases is equal to the number of ‘non-violation’ cases. Consequently, if we would just randomly guess the outcome, we would be correct in about 50% of the cases. Percentages substantially higher than 50% indicate that the model is able to use (simplified) textual information present in the case to improve the prediction of the outcome of a case. Table 5 shows the results for the cross-validation in the first row.
Table 5 Cross-validation (tenfold) and test results for Experiment 1 In order to evaluate if the model predicts both classes similarly, we use precision, recall and f-score to estimate the performance. Precision is the percentage of cases for which the assigned label is correct (i.e., ‘violation’ or ‘no violation’). Recall is the percentage of cases having a certain label which are identified correctly. The F-score can be described as the harmonic mean of precision and recall.Footnote 15 Table 6 shows these measures per class for each model.
Table 6 Precision, recall and f-score per class per article achieved during tenfold cross-validation As we can see from the table, ‘violation’ and ‘non-violation’ are predicted very similarly for each article.
During the training phase, different weights are assigned to the bits of information it is given (i.e. n-grams) and a hyperplane is created which uses support vectors to maximise the distance between the two classes. After training the model, we may inspect the weights in order to see what information had the most impact on the model’s decision to predict a certain ruling. The weights represent the coordinates of the data points, such as stars and circles we’ve seen in Fig. 1. The further the data point is from the hyperplane, the more positive the weight is for the violation class or the more negative the weight is for the non-violation class. These weights can then be used to determine how important the n-gram was for the separation. The n-grams that were most important hopefully may yield some insight into what might influence Court’s decision making. The idea behind the approach is that we will be able to determine certain keywords and phrases that are indicative of specific violations. For instance, if a case mentions a minority group or children, then based on the previous cases with the same keywords, the machine learning algorithm will be able to determine the verdict better.
In Fig. 3 we visualize the phrases (i.e. 3 and 4-grams) that ranked the highest to identify a case as a ‘violation’ blue on the right) or a ‘non-violation’ (red on the left) for Article 2. If we look at the figure, we can notice for instance that the Chechen Republic is an important feature in relation to ‘violation’ cases, while Bosnia and Herzegovina has a higher weight on the ‘non-violation’ side.
We also observe many relatively ‘meaningless’ n-grams, like in case no or the court of. This can be explained by the simplicity of our model and the lack of filtering of unnecessary information. As we have mentioned, the only processing of the text that we do is lowering the case of the words for some articles and filtering out some of the more common words such as the, a, these, on, etc. for some of the articles.
After tuning and evaluating the results using cross-validation we have tested the system using the ‘violation’ cases (for Article 14: non-violation cases) that have not been used in determining the best system parameters. These are the ‘excessive’ cases The results can be found in Table 5 (second results line). For several articles (6, 8, 10, 11, 13) the performance on the test set was worse, but for a few it performed quite a lot better (e.g., Articles 2, 5, 14). Discrepancies in the results may be explained by the fact that sometimes the model learns to predict non-violation cases better than violation cases. By testing the system on cases that only contain violations, performance may seem to be worse. The opposite happens when the model learns to predict violations better. In that case, the results on this violation-only test set become higher. Note that the test set for Article 14 contains non-violations only, and an increase in the performance here indicates that the model has probably learned to predict non-violations better. Nevertheless, the test results overall seem to be relatively similar to the cross-validation results, suggesting the models are well-performing, despite having only used very simple textual features.
Discussion
The results, with an average performance of 0.75, show substantial variability across articles. It is likely that the differences are to a large extent caused by differences in the amount of training data. The lower the amount of training data, the less the model is able to learn from the data.
To analyse how well the model performs, it is useful to investigate the confusion matrix. This matrix shows in what way the cases were classified correctly and incorrectly. For example, Table 7 shows the confusion matrix for Article 6. There were 916 cases in the training set for Article 6, half of which (458 cases) had a violation verdict, and the other half a non-violation verdict. In the table we see that from 458 cases with a non-violation, 397 were identified correctly, and 61 were identified as cases with a violation. Additionally, 346 cases with a violation were classified correctly and 112 cases were identified as a non-violation. Given that the amount of non-violation and violation cases is equal, it is clear from this matrix that the system for Article 6 is better at predicting non-violation cases than violation cases, as we could also see in Table 6.
Table 7 Confusion matrix for tenfold cross-validation for Article 6 Cases themselves may also influence the results. If there are many similar cases with similar decisions, it is easier to predict the judgement of another similar case. Whenever there are several very diverse issues grouped under a single article of the ECHR, the performance is likely lower. This is likely the cause of the relatively low performance of Article 8 (Right to respect for private and family life), which covers a large range of cases. The same can be said for Article 10 (right to freedom of expression), as the platforms for expression are growing in variety, especially since the spreading of the Internet.
To investigate the prediction errors made by each system, we focus on Article 13 having the highest accuracy score. Our approach was to first list the n-grams having the top-100 tf-idf scores for the incorrectly classified documents (separately for a violation classified as a non-violation and vice versa). We then only included the n-grams occurring in at least three incorrectly classified documents for each of the two types. For the correctly identified documents, we did the same (also two types: violation correctly classified and non-violation correctly classified) and then looked at the overlap of the n-grams in the four lists. While those lists contained very different words and phrases, we were also able to observe some general tendencies. For instance, phrases related to prison (e.g., ‘the prison’, ‘prisoner’, etc.) generally appeared in cases with no violation. Consequently cases with violation which do contain these words are likely to be incorrectly classified (as non-violation). Similarly, words related to prosecutors (e.g.,‘public prosecutor’, ‘military prosecutor’, ‘the prosecutor’) are found more often within cases with a violation and therefore non-violation cases with such phrases may be mislabelled. Table 8 shows a subjective selection of similar-behaving words.
Table 8 Comparison of selected n-grams within top-100 tf-idf scores for correctly predicted and mislabelled documents for Article 13 Note that the error analysis remains rather speculative. It is impossible to pinpoint what exactly makes the largest impact on the prediction, as the decision for a document is based on all n-grams in the document.
In the future more sophisticated methods that include semantic analysis should be used in order to not just predict the decisions, but to identify the factors behind the choice that the machine learning algorithm makes.
Experiment 2: predicting the future
Set-up
In the first experiment the test set was simply random sampled without considering the year of the cases. In this section we will assess how well we are able to predict future cases, by dividing the cases used for training and testing on the basis of the year of the case. Such an approach has two advantages. The first is that this would result in a more realistic setting, as there is no practical use of predicting the outcome of a case for which the actual outcome is already known. The second advantage is that times are changing, and this affects the law as well. For example, consider Article 8 of the ECHR. Its role is to protect citizens’ private life which includes, for instance, their correspondence. However, correspondence 40 years ago looked very different from that of today. This also suggests that using cases from the past to predict the outcome of cases in the future might reflect a lower, but a more realistic performance than the results reported in Table 5. For this reason, we have set up an additional experiment to check whether this is indeed the case and how sensitive our system is to this change. Due to more specific requirements of the data for this experiment, we have only considered datasets with the largest amounts of cases (i.e. Articles 3, 6, and 8) and have divided them into smaller groups on the basis of the year of the cases. Specifically, we evaluate the performance on cases from either 2014–2015 or 2016–2017 while we use cases up to 2013 for training. Because violation and non-violation cases were not evenly distributed between the periods, we had to balance them again. Where necessary we used additional cases from the ‘violations’ test set (used in the previous experiment) to add more violation cases to particular periods. The final distribution of the cases over these periods can be found in Table 9.
We have performed the same grid-search of the parameters of tf-idf and SVM on new training data as we did in the first experiment. We did not opt to use the same parameters as these were tailored to predict mixed-year cases. Consequently, we performed the parameter tuning only on the data up to and including 2013.
The two periods are set up in such a way that we may evaluate the performance for predicting the outcome of cases that follow directly after the ones we train on, versus those which follow later. In the latter case, there is a gap in time between the training period and testing period. Additionally we have conducted an experiment with a 10 year gap between training and testing. In this case, we have trained the model on cases up to 2005 and evaluated the performance using the test set of 2016-2017.
In order to be able to interpret the results better we conducted one additional experiment. For Experiment 1* we have reduced the training data from Experiment 1 to a random sample with the size equal to the amount of cases available for training in Experiment 2 (i.e 356 cases for Article 3, 746 cases for Article 6, etc.), but with all time periods mixed together. We compare cross-validation results on this dataset (Experiment 1*) to the results from the 2014–2015 and 2016–2017 periods in this experiment. The reason we conduct this additional experiment is that it allows us to control for the size of the training data set.
Results
As we can see from Table 10 training on one period and predicting for another is harder than for a random selection of cases (as has been done in Experiment 1). We can also observe that the amount of training data does not influence the results substantially. Experiment 1 produced an average of 0.77 for the chosen articles and Experiment 1* had a result almost as high. However, testing on separate periods resulted in a much lower accuracy. This suggests that predicting future judgements is indeed a harder task, and it gets harder if the gap between the training and testing data increases.
Table 10 Results for Experiment 2 Discussion
The results of Experiment 2 suggest that we have to take the changing times into account if we want to predict future cases. Therefore, while we can predict the decisions of the past year relatively well, performance drops when there is a larger gap between the period on the basis of which the model was trained and tested. This shows that a continuous integration of published judgements in the system is necessary in order to keep up with the changing legal world and maintain an adequate performance.
While there is a substantial drop in performance on the basis of the 10-year gap, this is likely also caused by a large reduction in training data. Due to the limit on the period, the number of cases used as training data was reduced to 112 for Article 3 (instead of 356), 354 (instead of 754) for Article 6, and 144 (instead of 350) for Article 8. Nevertheless, the large drop for Article 8, suggests that the issues covered by this Article has evolved more in the past decade than those of the other two articles.
Importantly, while these results show that predicting judgements for future cases is possible, the performance is lower than simply predicting decisions for random cases [such as the approach in our Experiment 1 and the approach employed by Aletras et al. (2016), Sulea et al. (2017b)].
Experiment 3: judges
Set-up
We also wanted to experiment with a very simple model. Consequently, here we use only the names of the judges that constitute a Chamber, including the President, but not including the Section Registrar and the Vice-Section Registrar when present as they do not decide cases. The surnames are extracted based on the list provided by ECtHR on their website.Footnote 16 However, ad hoc judges were not extracted, unless they are on the same list (e.g., from a different Section), due to the unavailability of a full list of ad hoc judges for the whole period of the Court’s existence. In our extraction efforts we did not account for any misspellings in the case documents, and consequently only correctly spelled surnames were extracted.
We have set-up our prediction model in line with the previous 2 experiments. As input for the model to learn from we have used the surnames of the judges. In total there were 185 judges representing 47 States at different times. The number of judges per State largely depends on when the State ratified the ECHR. Given that the 9-year terms for judges were established recently, some judges might have been part of the Court for a very long time. Some States, such as Serbia, Andorra, and Azerbaijan only have had 2 judges, while Luxembourg has had 7, and the United Kingdom has had 8. Only a single judge represents the State at any time.
We retained the same set of documents in the dataset as in the Experiment 1, but have only provided the model with the surnames of the judges. However, for this experiment we did not use tf-idf weighing. Instead we represented the features as each judge being either present on the bench or not.
Results
Using the same approach as illustrated in Sect. 5.1, we obtained the results shown in Table 11. In addition, Figs. 4 and 5 show the weights determined by the machine learning program for the top-20 predictors (i.e. the names of the judges) predicting the violation outcome versus the non-violation outcome.
Table 11 Tenfold cross-validation and test set results for Experiment 3 Discussion
While one may not know the judges that are going to assess a particular case, these results show that the decision is influenced to a large extent by the judges in the Chamber.
In this experiment we did not consider how each judge voted in each case, but only what the final decision on the case was. Consequently, it is important to note that, while some judges may be strongly associated with cases which were judged to be violations (or non-violations), this does not mean that they always rule in favour of a violation when it comes to a particular article of the ECHR. It simply means that this judge is more often in a Chamber which voted for a violation, irrespective of the judge’s own opinion.
Importantly, judges have different weights depending on the article that we are considering. For example, Polish judge Lech Garlicki frequently is associated with a ‘non-violation’ of Article 13, but for Article 14, this judge is more often associated with a ‘violation’. This is consistent with the numbers we have in our training data. Garlicki was in a Chamber that voted for a non-violation of Article 14 36 times and for a violation 34 times. On the other hand, Garlicki was in a Chamber that voted for violation of Article 13 6 times and for a non-violation 38 times.
It is interesting to see the results for the test set in this experiment. While the average results are very similar to cross-validation, scores for particular articles are very high. For instance, when predicting outcomes for Article 13 (right to an effective remedy) names of the judges are enough to get a correct judgement for 79% of violation cases. For Article 14 (prohibition of discrimination) the number is also very high—73%.
While the average results are lower than using the n-grams, it is clear that the identity of the judges is still a useful predictor, given that the performance is higher than the (random guess) performance of 50%.