1 Introduction

Case law has long impacted on the protection of animals. However, judgments concerned with animal protection have typically been assigned to established areas of legal practice such as veterinary negligence, defamation, criminal, regulatory and public law. Indeed, the recognition of animal protection law as a distinct area of professional legal practice is a relatively new phenomenon. This is highlighted by the recent formation of the UK’s first dedicated animal protection law firm, Advocates for Animals (founded in 2019: Advocates for Animals, n.d). The nascency of animal protection law as a distinct practice area means that there has been no publicly available repository of UK animal protection law judgments. With no repository, those seeking to identify animal protection law judgments might have to consider the relevance of individual judgments found through keyword searches. Such an approach is time-consuming, prone to human error and potentially hindered by flawed search tools (see Sect. 3.1). We therefore use computational techniques to create an initial repository of judgments meeting our adopted animal protection case law definition. Under this definition, an animal protection law judgment is one that substantially concerns, relates to, or affects the welfare or protection of one or more animals (following Overcash, 2012).

Beyond the creation of a repository, this research makes several subsidiary contributions. The suitability of different machine learning (ML, see Sect. 2) models to practice area classification is considered by comparing their performance (against one another and a baseline indicative of current judgment searching practice). Criticism of the limited interpretability of ML in law is addressed by exploring the most influential features in judgment classification. Further, this research is intended to promote ML understanding among non-technical legal researchers and practitioners through the employment of a straightforward structure.

This paragraph lays out this structure for the remainder of the paper. In Sect. 2, we describe the concepts underpinning this work, identify related work and consider why the application of ML in law remains contentious. Section 3 presents our repository creation process, with dedicated subsections explaining how we:

  • Searched judgments for those containing a keyword

  • Labelled a selection of these judgments,

  • Trained models using these data,

  • Calculated results and chose the final model,

  • Evaluated the final model,

  • Used the model to classify unlabelled judgments.

Repository creation information is followed by a discussion of findings, limitations and potential future work in Sect. 4. Lastly, Sect. 5 concludes the paper by summarising the creation process and contents of our law repository.

2 Background

This paper details the creation of an animal protection law repository using ML. ML is the practice of using algorithmic methods to learn from data and make predictions. It is often used as a tool for natural language processing (NLP), which is concerned with using computational techniques to process and analyse human language data. Both ML and NLP have been employed in the legal space for some time (Nay, 2018). Westlaw, for example, has offered legal search tools drawing on both techniques since the 1990s (Thomson Reuters, n.d.). As the availability of law-related documents has heightened, modelling and feature extraction approaches originating from the fields of ML and NLP (see Sect. 3) have become ever more important to legal research.

The (increasing) importance of ML and NLP techniques is reflected in their application to a broad range of legal tasks. These tasks include computing judgment similarity (Mandal et al., 2021), predicting violations of the European Convention on Human Rights (Aletras et al., 2016; Medvedeva et al., 2020) and gauging the influence of demographics characteristics on judicial decision making (Rachlinski & Wistrich, 2017). The application of ML and NLP has even extended to the task underlying this researchlegal text classification. Prior law-centred classification endeavours have involved the identification of: documents ‘relevant’ and ‘not relevant’ to legal claims during e-discovery processes (discussed by Ashley, 2017); whether a statutory provision applies to a legal issue (Savelka et al., 2015); and, which well-established practice area a court judgment falls into (Lei et al., 2017; de Araujo et al., 2020).Footnote 1 In creating a judgment repository for a recently recognised practice area that is partially automatically constructed, this research therefore widens the range of legal classification tasks addressed through ML and NLP.

There are patterns in how previous practice area classification efforts have both represented legal documents and used these representations during modelling. Judgments have typically been represented using methods based on term frequency. De Araujo and colleagues (2020) classified Brazilian lawsuits by established ‘themes’ using a tuned term frequency—inverse document frequency approach (TF-IDF: see Sect. 3.3.1). Similarly, Lei and colleagues (2017) used a TF-IDF method as their sole document representation method when categorising Chinese judgments by ‘industry divisions’.Footnote 2 We find no evidence that neural representation approaches such as Bidirectional Encoder Representations from Transformers (BERT) have been used for judgment practice area categorisation. However, the viability of such representation approaches is suggested by their employment in other legal document classification work. Undavia et al. (2018) found the automatic classification of legal court opinions into Supreme Court Database categories to be best achieved using a neural representation method (in conjunction with a convolutional neural network model). Embeddings from a BERT-based model were found to provide an effective foundation for multi-label classification of a dataset of legal opinions (Song et al., 2021). Additionally, Longformer achieved state-of-the-art performance on a case outcome prediction task (Bhambhoria, Dahan & Zhu, 2021).Footnote 3

Both Lei and colleagues (2017) and de Araujo and colleagues (2020) trialled multiple ML models when classifying court judgments using TF-IDF representations. In the former, this trialling showed a linear support vector machine (SVM: see Sect. 3.3.2) model to outperform naive Bayes (NB), decision tree and random forest models. In the latter, the results achieved by an XGBoost approach surpassed those from SVM and NB approaches. When using a smaller dataset, however, XGBoost and SVM results were comparable (de Araujo et al., 2020). To our knowledge, there exists no published research in which multi-layer perceptrons (MLPs, see Sect. 3.3.2) have been applied to judgment practice area tasks. Yet, MLPs have been used elsewhere in legal research to, for example, recognise named entities (such as the date of judgment) within court judgments (Vardhan et al., 2020). MLPs also have aided a diverse range of document classification efforts outside the legal sector, spanning from the categorisation of German business letters (Wenzel et al., 1998) to identifying the author of plays (Merriam & Matthews, 1994).

While ML has been used in various legal settings, its application remains contentious. Three key reasons for this contentiousness are briefly considered, each of which is pertinent to our research. Firstly, methods involving ML are considered backward-looking in relation to legal reasoning (Markou and Deakin, 2020). ‘ML’s effectiveness is diminished in direct relation to the novelty of the cases it must process’ and, connectedly, the rate of change in the context where it is applied (Markou and Deakin, 2020, p. 63). This concern is clearly consequential to certain legal tasks, such as the prediction of future European Court of Human Rights case outcomes (see Medvedeva et al., 2020). The backward-looking nature of ML might also affect our ML endeavours. This is because our selected model could conceivably be applied to judgments made in the years following 2020. However, the primary intention of this research was inherently retrospective—to create a repository of existing judgments between 2000 and 2020. As such, we ultimately specified our model in a manner that prioritised predictive performance on past judgments over future judgments (Sect. 3.5).

A second concern about the use of ML in law stems from the potential threat that these methods pose to the autonomy of judicial processes. This threat would be clearly manifested should ML methods be used to conduct adjudication without human participation (Markou & Deakin, 2020). In such an instance, ML approaches would be contributing directly to law creation. In contrast with a model that carries out adjudication, the ML models in our paper are merely intended to aid the identification of animal protection law judgments. They could not, therefore, be seen to contribute to the law directly. Yet, the case law repository created through machine-based approaches might still contribute to the indirect creation of law (see further, Burri, 2017).Footnote 4 This could occur if, for example, lawyers’ arguments were influenced by animal protection judgments that they would not have identified without a repository made using ML predictions. Still, the tangential nature of such a contribution to law creation means that the models considered in this paper need not be considered to pose a serious threat to judicial autonomy. Indeed, lawyers concerned with animal protection law are likely already using machine-based (albeit not ML-based) approaches to identify judgments (such as the ‘Case Law Search’ tool discussed below: Sect. 3.1).

Thirdly, arguments against the employment of ML in law have focused on the limited extent to which these techniques permit human understanding. Deakin and Markou (forthcoming, p. 15) contended that the findings of ML models cannot ‘be adequately explained using the types of arguments which lawyers are accustomed to making’. However, this research makes efforts to address the comprehension-based concerns voiced in the law and technology literature. The process through which ML techniques are applied is detailed in a step-by-step manner. Papers with such a format have arisen infrequently, with these instead more ‘focused on the implications of the running model’ (Lehr & Ohm, 2017, p. 655). Further, the features most influential in determining whether or not a judgment is classified as concerned with animal protection law by our selected model are presented graphically and considered qualitatively (Sect. 3.5). By providing intelligible explanations of the manner in which our final model classifies judgments and the process through which it was created, it is hoped that this paper contributes to efforts to demystify the use of NLP and ML in law (Lehr & Ohm, 2017).

3 Repository creation

3.1 Searching judgments

Creating an animal law judgment repository began by identifying court judgments that contained the word ‘animal’. We adopted the assumption that this term would be present in every animal protection law judgment through discussion with the domain expert, who felt it highly improbable that any such judgment would not contain this term (Sect. 4). In fact, it is feasible that prior attempts to identify relevant judgments would have treated any containing ‘animal’ as relating to animal protection law until proven otherwise. We therefore use this strategy as a baseline measure for comparison against ML models (Sect. 3.4).Footnote 5

The judgments searched were all those available from the British and Irish Legal Information Institute (BAILII) made by the Privy Council, House of Lords, Supreme Court and upper England and Wales courts between 2000 and 2020.Footnote 6 BAILII was used as the basis for this search as it provides the most extensive collection of British legal materials freely available online (BAILII, n.d.). Searching involved opening each judgment and recording its URL when ‘animal’ was found in the text. Implementation of the search in late December 2020 found that 1637 of the 55,202 judgments by upper courts from January 2000 to December 2020 contained the word ‘animal’.

Those judgments that contained ‘animal’ were typically longer than those that did not. The median word length for judgments with ‘animal’ was 10,785, while those without had a median of 5869. This corresponds with findings on the number of sentences in judgments containing ‘animal’. Judgments containing ‘animal’ had a median of 420 sentences, while those without had a median of 231. These statistics are presented below, alongside minimum, maximum, 5th percentile and 95th percentile values for both the number of words and sentences across each group (Table 1). The difference in length is potentially unsurprising: the likelihood of any given term occurring in a document increases with document length.

Table 1 Descriptive statistics for all judgments

An attempt was made to streamline the search for relevant judgments by using BAILII’s inbuilt ‘Case Law Search’ tool. This tool appeared as if it should be able to identify judgments containing a user-specified word. However, trialling demonstrated that the ‘Case Law Search’ misreported the number of judgments identified, reported certain judgments twice, picked up judgments that did not contain ‘animal’ and missed other judgments that did. Specifically, a BAILII search of the same courts over the same period as used previously claimed to provide 1810 judgments containing ‘animal’, but actually gave a list of 1809 judgments of which 1800 were unique and that only contained the term ‘animal’ on 1568 occasions. What is more, BAILII’s Boolean search tool did not provide a count of how many judgments were searched. Using the ‘Case Law Search’ would therefore also have obstructed the identification of the proportion of judgments that contained the term, ‘animal’, and were concerned with animal protection law (Sect. 3.6). The identification of these limitations suggested the judgment-by-judgment search method initially employed to be better for creating an animal protection law repository. Additionally, the shortcomings of BAILII’s ‘Case Law Search’ mean that our repository of 1637 judgments is a more precise collection of judgments containing ‘animal’ than that which could be created through BAILII alone.

3.2 Labelling judgments

500 judgments were randomly sampled for human labelling from the 1637 judgments found to contain ‘animal’. Labelling was carried out by the domain expert alone, following guidance written by them and the lead author (available here: https://github.com/JoeMarkWatson/animal_law_classifier/blob/main/animal_protection_law_labelling_guidance.docx). After human labelling was completed, stratified random sampling was used to create a training set of 400 labelled judgments and a testing set of 100 labelled judgments, each with the same proportion of positively-tagged judgments. Accordingly, the 400 training set judgments included 66 concerning animal protection law, while the 100 test set judgments included 17 (after the correction of one human labelling error that did not affect stratification: Sect. 3.5.2). Word and sentence length information is provided for those judgments containing ‘animal’ that: were not labelled; were labelled; were labelled and assigned to the training set; and, were labelled sentences and assigned to the test set (Table 2).

Table 2 Descriptive statistics for judgments containing ‘animal’

3.3 Model training

Each implemented ML approach depended on both feature extraction and modelling. Feature extraction involved transforming the text of each judgment into a format suitable for use in a ML model. Modelling entailed using the extracted features for each judgment, combined with their label, for ML training.

3.3.1 Extracting features

Five feature extraction methods were used: TF-IDF vectors; USE (Cer et al., 2018); sentence-BERT (s-BERT: Reimers & Gurevych, 2019); Longformer (Beltagy, Peters & Cohan, 2020); and BigBird (Zaheer et al, 2020) embeddings. Each method takes the text of a judgment as input, and returns a vector representation of the text. The first 200 words of each judgment were excluded, as this contained judges’ and party’s names that, if employed in a classification model, could lead to overfitting. Explicitly identifying and removing this information would have been preferable, yet such an approach was obstructed by the (variable) structure of judgments on BAILII (see Sect. 4).

TF-IDF is a well-established approach in the field of NLP (Jones, 1972) that has been frequently applied to judgment classification tasks (see Sect. 2). The approach is based on a bag-of-words assumption, which takes no account of the relationship between terms. TF-IDF is defined as follows (Eq. 1):

$$TF - IDF\left( {t, \, d} \right) \, = \, TF\left( {t, \, d} \right) \, * \, IDF\left( t \right)$$
(1)

The TF-IDF value for a term t in a document d is a function of its frequency within that document (term frequency, TF) and its overall frequency in the corpus (as inverse document frequency, IDF). IDF is defined in Eq. 2:

$$IDF\left( t \right) \, = \, log \, \left[ { \, \left( {1 \, + \, N} \right) \, / \, \left( {1 \, + \, DF\left( t \right)} \right) \, } \right] \, + \, 1$$
(2)

In the above, N is the number of documents in the corpus, and DF(t) is the number of these that contain t.Footnote 7

Various pre-processing methods were tested when creating TF-IDF vectors; the use or not of each method was controlled via a parameter. One such parameter controlled whether terms were lemmatised or not before term and document frequencies were calculated. Lemmatisation involves reducing terms that have been inflected in accordance with tense or number, inter alia, to a root term (lemma). For example, the lemmas of ‘culling’ and ‘culled’ are both ‘cull’. Additionally, the terms themselves could be single words (unigrams), or multiple contiguous words grouped to act as a single term (n-grams; bigrams when two contiguous words are used). Minimum and maximum term document frequency thresholds were also set, with the terms appearing above or below a certain frequency excluded (as these might not aid classification efforts). Lastly, the vector size was controlled by only counting the top n most frequent terms. All parameters were optimised during training (see Sect. 3.3.3).

When the TF-IDF feature extraction method was used, text underwent additional pre-processing before creating representations. This additional pre-processing was only carried out when creating TF-IDF vectors, as neither USE nor s-BERT embeddings directly represent individual terms in the same way as TF-IDF vectors. The intention of this TF-IDF pre-processing step was to improve the generalisability of the model to data outside of the training set. Pre-processing therefore involved removing aspects of the text which were assumed not to be indicative of a judgment’s classification, but could nonetheless be used by the model because of a chance correlation with one of the classes. This entailed:

  • Removing URLs and HTML tags from the text,

  • Transforming words to lowercase,

  • Deleting digits and punctuation from the text,

  • Retaining only English words.

While transformer-based embedding approaches have been used previously for numerous legal tasks (outside judgment practice area classification: Sect. 2), these remain relatively new developments in NLP. USE and s-BERT sentence-embedding approaches were presented, respectively, in 2018 by Cer and colleagues, and 2019 by Reimers and Gurevych in 2019 (following the previous development of BERT: Devlin et al., 2019). The Longformer and BigBird transformer approaches to long text sequence embedding were developed more recently still (Beltagy, Peters & Cohan, 2020; and Zaheer et al, 2020, respectively). These models have all been trained on a large amount of text; by enhancing the investigation of newly available data through models trained on previously available data, this research employed a form of transfer learning (Devlin et al., 2019). This instilled our judgment classifier with natural language understanding derived from text data outside training set judgments. As a result, embeddings-based models might be able to function adequately with less labelled data than that required by more traditional approaches (like TF-IDF: Asgari-Chenaghlu et al., 2020). Achieving acceptable performance with limited training data was important in our case, as human labelling was both time consuming (taking approximately 10 hours per 100 judgments) and could conceivably have been expensive (see further, Muller et al., 2021).Footnote 8

There are multiple types of USE, s-BERT, Longformer and BigBird models. In this research, the large USE, base s-BERT, Longformer and BigBird models were used, returning 512-dimensional embeddings for USE and 768-dimensional embeddings for all other models. As all models are pre-trained, no hyperparameters need to be tuned. Longformer and BigBird were used because they are designed to embed text sequences longer than just sentences, which USE and s-BERT are designed for. One or more embeddings are computed for each document using each model; for USE and s-BERT by embedding each sentence, and for Longformer and BigBird by splitting judgments into chunks up to the maximum length allowed by the models, and embedding these.

As it is desirable for a judgment to be represented by only one embedding, we consider two methods for obtaining these from the (potentially) multiple that exist. The first is to select the embedding of the first sentence or chunk, and the second is to take a mean average over all embeddings. Prior work has found averaging sentence embeddings to work well for document retrieval (Yang et al, 2019), aspect-extraction (Verma et al, 2021), and fake news identification (Slovikovskaya & Attardi, 2020). The second approach is therefore taken for the sentence-based models. Another reason for this choice is that it is unlikely that the first sentence of a judgment would be sufficient to identify whether the case concerned animal welfare according to our definition. Two steps were taken to limit the computational demands of the sentence embedding process. First, sentence embeddings were not created in the rare case when a sentence was over 1000 words. Second, 5000 sentences were randomly sampled for embedding from any judgment that exceeded 5000 sentences in length.

Both approaches were trialled for the document-level models. It was found that using the first chunk gave better results than averaging all embeddings. This is likely due to the large amount of text represented by each embedding. Indeed, 49 of the 500 labelled judgments could be entirely represented by one embedding.

3.3.2 Modelling

This research trialled two modelling approaches. One was a linear SVM (Cortes & Vapnik, 1995) specified using the scikit-learn library (Pedregosa et al., 2011). The other was a MLP (Rumelhart et al., 1986) created through the scikit-learn wrapper for Keras (Chollet, 2015). The trialling of a linear SVM approach follows their strong performance in previous judgment practice area classification tasks (Sect. 2). Linear SVMs output a single weight per input feature based on ‘a boundary in a vector space between the positive and negative instances of a category or class that is maximally distant from all of the instances’ (Ashley, 2017, p. 251). These models are trained to maximise the margin of the decision boundary and classify as many training points correctly as possible (Boser, Guyon & Vapnik, 1992). These two training aims can conflict, with the extent to which priority is given to either objective controlled through the regularisation hyperparameter, C (see below).

MLPs have long been applied to a broad range of document classification tasks, although this has hitherto not extended to practice area classification (Sect. 2). In contrast with the SVM, which separates data with a linear boundary, the MLP includes an activation function which makes a nonlinear boundary possible (see Sect. 3.3.3). This function is applied in a so-called ‘hidden layer’ between the input layer, where the features are provided to the model, and the output layer, where the classification is given. If our judgment classification task had such complexity that a linear boundary could not adequately separate classes, the application of a nonlinear boundary could increase model performance. However, it is also possible that using a non-linear multi-layer method could lead to overfitting that limits generalisability to new data.

Creating MLP models involved setting various hyperparameters. One of these controlled whether dropout (Srivastava et al., 2014) was employed. Dropout is a regularisation technique that limits potential overfit by randomly dropping neurons and their connections during training, which has been shown to improve ML performance on tasks including document classification (Maaten et al., 2013; Srivastava et al., 2014). Where a non-zero dropout value was used, it was applied to both the input and hidden layer and used in conjunction with max-norm regularisation (set at a value of 3, as using dropout with max-norm regularisation is likely to produce better results than dropout alone: Srivastava et al., 2014). Hyperparameter settings also affected the model’s learning rate and number of epochs. The learning rate controls the magnitude of changes made to model weights during each update, with higher learning rates producing larger changes. An epoch denotes a full pass through the training data. These two settings are interdependent, as models trained with low learning rates will generally require more epochs to train and vice versa.

3.3.3 Implementation

Ten ML systems were established by combining the two modelling approaches (SVM and MLP) and five feature extraction approaches (TF-IDF, USE, s-BERT, Longformer and BigBird). The complexity of these systems differs in accordance with two factors. Firstly, a MLP is more complex than a SVM. SVMs are only able to separate data with a linear boundary, while MLPs can use a non-linear boundary. Secondly, TF-IDF embeddings are simpler than USE, s-BERT, Longformer and BigBird embeddings. TF-IDF embeddings are derived from word frequency counts, whereas embeddings approaches use large neural networks trained on external data.

A grid search was performed to identify the optimal hyperparameters for each ML approach, by choosing the set of hyperparameters that gave the highest average macro-F1 score in five-fold cross validation. Stratified five-fold cross validation splits the dataset into five equally sized sets, or ‘folds’, each with the same proportion of judgments related to animal protection law. In each iteration, four folds are used together for training while the remaining one is used for validation. This process is depicted below (see Fig. 1).

Fig. 1
figure 1

Five-fold cross validation. Note. In each row, folds used for training are in yellow and the validation fold is blue. (Color figure online)

We considered the optimal hyperparameters to be those that achieved the greatest mean macro-F1 score across validation folds. Macro-F1 is a simple arithmetic mean of per-category F1 scores. F1 is given by the following equation (Eq. 3):

$$F1 \, = \, 2\left( {P*R} \right)/P + R)$$
(3)

In the above formula, the F1 score is the harmonic mean of precision (P) and recall (R). Precision and recall are defined below (Eqs. 4 and 5).

$$R \, = \, TP/\left( {TP \, + \, FN} \right)$$
(4)
$$P \, = \, TP/\left( {TP \, + \, FP} \right)$$
(5)

In Eqs. 4 and 5, TP refers to a true positive, FN to a false negative and FP to a false positive classification. Macro-averaged F1 is preferable to accuracy when there is a class imbalance as it more greatly reflects poor performance on the minority class.

For all SVM systems, the following hyperparameters were optimised:

  • Loss function: squared hinge or hinge,

  • Regularisation value, C: 0.1, 1, 2, 5, or 10.Footnote 9

For MLP systems, different hyperparameters were tuned:

  • Number of neurons in the hidden layer: NF/16, NF/8, NF/4, and NF/2,Footnote 10

  • Dropout on both the input layer and hidden layer: 0, 0.2,

  • Learning rate: 0.1, 0.01 and 0.001,

  • Number of epochs: 20, 50, 100 and 200.

For MLP systems, ‘NF’ in the first bullet point refers to the number of input features; for example, in USE embeddings this was 512. All models employed one hidden layer with ReLU activation and an output layer of one neuron with sigmoid activation. Each MLP system was also trained using the Adam optimiser (Kingma & Ba, 2017) and binary cross-entropy loss with a batch size of 32.

In SVM or MLP systems that used TF-IDF vectors, we also optimised the following feature extraction hyperparameters:

  • Term lemmatisation: using lemmatised or unlemmatised terms,

  • Single terms or multiple contiguous terms: employing unigrams, or unigrams and bigrams,

  • Upper document frequency (DF) threshold: ignoring terms featuring in over 60 percent, 70 percent, or 80 percent of all documents,

  • Lower DF threshold: ignoring terms featuring in less than one, two or three documents,

  • Vector size: Employing a maximum number of TF-IDF features of 500 or 1000.

3.4 Results and final model selection

The optimal hyperparameter combinations for each system are detailed below (Tables 3 and 4), with the macro-F1 values achieved by these combinations provided later (alongside test set results: Table 5).

Table 3 Optimal hyperparameter combinations for TF-IDF systems
Table 4 Optimal hyperparameter combinations for embeddings systems
Table 5 Comparison of system macro-F1

After establishing the best-performing hyperparameters for each system, we next considered these models against a baseline measure and one another. In the baseline measure, all 100 judgments in the test set were assumed to concern animal protection law (in a manner intended to mimic existing attempts to identify animal protection law judgments: Sect. 3.1). For the systems, the versions that achieved the best average score across validation folds were used to predict the classifications of the test set judgments before macro-F1 scores were recorded (Table 5). While average validation fold and test set results are provided together, only the test set judgments had not been used for any part of model training. As such, the macro-F1 testing findings provide a more unbiased estimate of model performance.

All ML-based systems significantly outperformed the baseline on the test set (p = 0.001).Footnote 11 Amongst these ML systems, the best-performing was the SVM model with TF-IDF features. The difference between the TF-IDF SVM and other systems was significant (p = 0.05) except in cases where USE embeddings were employed. As the TF-IDF SVM outperformed the less complex baseline measure while clearly not being outperformed by any more complex system, our results suggested that it should be used to predict the classification of the remaining (unlabelled) judgments.

Before concluding system selection, we considered whether the relative testing performance of the TF-IDF SVM seemed logical. Various reasons could suggest this not to be the case. Firstly, all feature extraction methods other than TF-IDF might have allowed models to benefit from transfer learning (Sect. 3.3.1). Secondly, as TF-IDF makes a bag of words assumption there is some loss of information in the embedding (Sect. 3.3.1). Thirdly, unlike a SVM,Footnote 12 a MLP can learn nonlinear decision boundaries, which is plausibly necessary to model the intricacies of judgment classification.

However, we identified a number of factors that did support the relative strength of the TF-IDF SVM approach. The tuning process for the TF-IDF SVM optimised feature extraction hyperparameters. This is in contrast with transformers, which were not tuned for the task. Additionally, it is quite possible that useful information was lost in the creation of our embeddings features. The averaged USE and s-BERT embeddings were simple means of the many sentence vectors created for each judgment. Similarly, the Longformer and BigBird embeddings did not take account of the latter part of longer judgments. The superiority of the TF-IDF SVM over the TF-IDF MLP could also be due to SVMs being insensitive to class imbalance relative to MLPs (Japkowicz & Stephen, 2002) and possible MLP overfit.Footnote 13 Moreover, the finding that a TF-IDF feature extraction technique and SVM architecture performed strongly on a judgment classification task is consistent with previous literature (Sulea et al., 2017b; Lei et al., 2017). Given the existence of reasons why a TF-IDF SVM might outperform other models and congruous findings in related work, our choice to use a TF-IDF SVM was finalised.

3.5 Investigating the chosen model

3.5.1 Considering influential features

To interpret the behaviour of the model, the features (lemmas) with the highest and lowest feature coefficients (weights) were plotted (Fig. 2). Lemmas with high coefficient values were likely to be most predictive of judgments that were concerned with animal protection law while those with low coefficient values were likely most predictive of judgments that were not. These weights provided us with encouragement that the selected model was taking relevant information into account when classifying judgments. There were very few ‘meaningless’ unigrams amongst those with the highest and lowest coefficient values (cf. Medvedeva et al., 2020, p. 255). In fact, the majority of the terms with positive coefficient values in Fig. 2 were logical predictors of animal protection law judgments. ‘Welfare’, ‘hunt’ and ‘conservation’, for example, appeared likely to correspond with the adopted definition of animal protection law (see Sect. 1; Overcash, 2012).

Fig. 2
figure 2

Ten most positive and negative coefficient values assigned to model lemmas

The lemmas with negative coefficients also typically seemed rational indicators of judgments containing the term ‘animal’ that were not predominantly concerned with the welfare or protection of animals. These included ‘jury’, the presence of which reflected the fact that most criminal animal protection law judgments from 2000 to 2020 came from summary cases (which have no jury). However, the very reason why ‘jury’ could be considered a rational indicator was also a source of contention. Animal offences will soon become either way offences which are triable with a jury, after the enactment of the Animal Welfare (Sentencing) Act 2021 later this year (2021). This could cause an increase in the use of ‘jury’ in animal protection law judgments, thereby reducing the extent to which the lemma is a useful negative predictor. The domain expert initially felt that the feature should therefore be removed. However, removing ‘jury’ reduces model performance (albeit not significantly) on the main aim of the system: creating a repository of existing animal protection law judgments. The feature was therefore kept, meaning that the system remained backward-looking (Markou and Deakin, 2020).Footnote 14

On first inspection, the domain expert also voiced that a minority of features were not intuitive predictors of whether a judgment was concerned with animal protection law. These included ‘schedule’, which had the most negative coefficient value. In apparent contrast with this value, the feature could certainly have occurred in judgments concerning animal protection law. The Endangered Species (Import and Export) Act 1976, for example, contains multiple schedules of relevance to animal welfare that might have been drawn on in various animal protection law judgments. In fact, schedules within this Act are referenced in a judgment in our initial repository that was classified as animal protection law by our ML model (R v. Sissen, 2000). Yet, further domain expert examination of our judgment repository uncovered many examples of judgments containing the term ‘schedule’ that did not satisfy our animal protection law definition. These included an EWCA judgment (European Brand Trading Ltd v. HM Revenue and Customs, 2016) which referred to schedules under the Customs and Excise Management Act 1979. The domain expert therefore ultimately accepted that the presence of ‘schedule’ could well have been more indicative of a judgment not concerning animal protection law.

3.5.2 Error analysis

An error analysis was carried out on each test set judgment that was mis-predicted by the ML system. A confusion matrix shows that there were relatively few incorrectly predicted test set judgments (7 of 100: Table 6). This matrix also suggests that the system might be more susceptible to false negative errors (i.e., predicting that a judgment is not concerned with animal protection law when it actually is). Conducting human investigation into false negative errors was therefore considered imperative.

Table 6 Confusion matrix

Qualitative feedback provided by the domain expert showed the majority of false negative predictions to be highly marginal cases. These included a judgment involving the movement of cattle and regulations around bovine tuberculosis (Banks v. Secretary Of State For Environment, Food & Rural Affairs, 2004). While the judgment had clear implications for the protection of animals, these were primarily discussed in financial terms. Additionally, one judgment centred on slander relating to allegations of animal cruelty (Barkhuysen v. Hamilton, 2016). Given that the allegations under discussion were almost certainly false, the domain expert acknowledged that others could contend the protection of animals was never truly a central issue. In contrast with false negative errors, the domain expert felt that both false positive errors were unambiguous (First Corporate Shipping Ltd t/a Bristol Port Company v. North Somerset Council, 2001; Sienkiewicz v. South Somerset District Council, 2015). Still, it was noted that these judgments possessed terminology indicative of animal protection law (with both referencing ‘wildlife’ and the ‘environment’). Feedback on all errors is provided as an annex (‘Appendix 1’).

This error analysis also led to the identification and subsequent correction of one annotation mistake, where a judgment was initially labelled as not concerning animal protection law. The judgment in question was R (on the application of Aggregate Industries UK Ltd) v. English Nature (2002). Re-inspection showed that this judgment clearly centred on animal protection: it concerned a challenge by industry to a Department for Environment, Food and Rural Affairs (DEFRA) decision to designate a particular area as a Site of Special Scientific Interest, which had direct implications for the protection of wild birds. Upon finding that the initial human classification for this judgment was the result of a labelling mistake, the human classification was revised and test set results for all models were re-calculated.

The correction of a labelling error and re-calculation of results could be seen as undesirable. As judgments were revisited solely in instances where the selected ML model and human classifications did not match, any labelling change would only lead to an improvement in the model’s macro-F1 score. However, we felt it appropriate that reported results were adjusted for any known error.Footnote 15 Further, amending the labelling error improves the judgment repository as this includes all human-labelled judgments.

3.6 Classifying unlabelled judgments

The selected system was applied to the remaining 1137 unlabelled judgments and the results combined with the 500 labelled judgments. This gave a repository of 1637 judgments from the Privy Council, House of Lords, Supreme Court and upper England and Wales courts containing ‘animal’ and their classifications (available at: https://github.com/JoeMarkWatson/animal_law_classifier/blob/main/case_law_repository.csv). 175 (10.7%) of these judgments were classified as meeting our definition of animal protection law, including 92 found automatically. Following our assumption that the term ‘animal’ should be present in every animal protection law judgment (presented in Sect. 3.1 and further discussed in Sect. 4), this finding tentatively suggests that 0.32 percent of all 55,202 judgments from our selection of courts were substantially concerned with animal protection law.

The proportion of animal protection law judgments among human- and ML-classified judgments differed substantially. 16.6 percent (or, 83 of 500) of the judgments labelled by the domain expert were found to concern animal protection law, while just 8.09 percent (92 of 1137) of non-labelled judgments were predicted to concern animal protection law by the ML model. It is potentially concerning that the proportion of predicted animal protection law judgments is lower than the proportion of human-labelled judgments with the same classification. This suggests that the selected ML model could have misclassified some of the non-labelled judgments and that misclassified judgments were more likely to be false negatives. This idea could be corroborated by the fact that there were more false negatives than false positives among our selected model’s test set results (Sect. 3.5.2).

However, domain expert feedback on mis-predicted test set judgments did not suggest the selected model to perform poorly on judgments that concerned animal protection law. Multiple missed animal protection law judgments were highly marginal decisions (‘Appendix 1’). The model also correctly classified the majority (12 of 17) of animal protection law judgments in the test set and made almost as many animal protection law predictions (14) as there were judgments (17). What is more, all model predictions on the test set and unlabelled data were made using an architecture that is relatively unsusceptible to class imbalance (Japkowicz & Stephen, 2002). All this suggests that the 500 judgments randomly selected for human labelling might simply have possessed a higher proportion of animal protection law judgments than the 1137 judgments not selected. Tentative support for such a conclusion could be provided by the findings of a brief inspection of ML-classified judgments carried out by the domain expert, which suggested both animal protection law and not animal protection law classifications to be sensical.

The domain expert also surveyed both the ML-predicted and human-labelled judgments that were classified as animal protection law. This highlighted the broad range of issues covered by judgments concerned with animal protection law between 2000 and 2020. These included conventional public law challenges to government actions or decisions (such as a decision to proceed with the culling of badgers: R (on the application of National Farmers Union) v. Secretary of State for Environment, Food and Rural Affairs, 2020) and interpretations of the provisions of animal welfare offences (R (on the application of Highbury Poultry Farm Produce Ltd) v. Crown Prosecution Service, 2020). There were also advertising standards judgments concerning the use of animal welfare-related language in advertising materials (R (on the application of Sainsbury's Supermarkets Ltd) v. The Independent Reviewer of Advertising Standards Authority Adjudications, 2014) and planning judgments that affected endangered animals (R (on the application of Bizzy B Management Ltd) v. Stockton-On-Tees Borough Council, 2011). It is hoped that these initial remarks on the breadth of topics within the judgment repository are a precursor to further reflection by other users.

4 Discussion

Users of the animal protection law repository should be aware that it is restricted in scope. Such words of caution would be relevant to any attempt to compile UK court judgments from BAILII. This collection of legal materials contains many but not all judgments (from 2000 onwards in the High Court and above). Additionally, at the time of writing, our repository only includes judgments on BAILII that were made available online before analysis began (mid-December 2020). Indeed, our selected model might not be so successfully applied to future judgments (given anticipated changes in animal protection law judgments: Sect. 3.5.1).Footnote 16 The repository also only contains judgments in which the word ‘animal’ was used. The domain expert advised that it was highly likely that all relevant judgment would contain ‘animal’, yet it remains feasible that there are animal protection law judgments which do not include the term.Footnote 17 Lastly, UK animal protection law as a whole extends beyond court judgments to ‘hard law in the form of statutes and treaties and soft law such as standards issued by international organisations’ (Peters, 2020, p. 1).

Beyond limitations in the scope of documents on which the model was trained and applied, there were also multiple ways in which the performance of different trialled systems might have been increased. For all USE, s-BERT, Longformer and BigBird systems, text embeddings were created using pre-trained transformer models. It is conceivable that fine-tuning the transformers could have produced superior results (Devlin et al., 2019; Sun et al., 2019). Additionally, the features used to represent each judgment were not derived from any specific portion of the judgment text. Working with only particular sections of the judgment such as the facts of the case might have enhanced prediction accuracy. Lastly, the classifier was trained using just 400 judgments labelled by a single domain expert. Using multiple people to label the same number of judgments might have reduced error (see further, Aslan et al., 2017; Muller et al., 2021). Alternatively, using a greater number of human-labelled judgments for model training would likely have improved the model’s predictions (and increased the proportion of judgments in the repository that were human-classified).

While one or more of these tasks could be considered necessary for the creation of a conclusive repository of judgments, each was beyond the scope of this initial repository creation project. Tuning sentence encoders and extracting sections of the text from judgments on BAILII are both more complex than the work presented here. With regards to the latter, the (inconsistent) structure of BAILII judgments obstructs the division of judgments into distinct sections (cf. Medvedeva et al., 2020). Further, additional labelling would have added time and potentially cost to the project. Each recognised limitation is therefore merely considered a pointer towards potential future research.

Indeed, we remain confident in the ML system on which our judgment repository was partially built. This performed significantly better than most alternate systems and a baseline measure intended to reflect current searching practice (Sect. 3.4). It was also constructed in a manner that permitted investigation into influential features. This investigation suggested many features to make rational contributions to judgment classification (Sect. 3.5.1). Amongst the (rational) negative predictors was the term ‘jury’, which stimulated important discussion of the backward-looking nature of our model. This consideration of influential terms highlighted the benefits of using ML systems that permit some level of human understanding.

5 Conclusion

Using animal protection law as a case study, this paper has shown that ML can be employed to create a worthwhile judgment repository concerning a new practice area. To achieve this, we outlined a judgment repository creation process that began with the identification of 1637 judgments on BAILII containing ‘animal’ made by the Privy Council, House of Lords, Supreme Court and upper England and Wales courts between January 2000 and December 2020. This amount contrasts with BAILII’s own search tool, which only identified 1568 judgments. 500 of the judgments containing ‘animal’ were labelled by a domain expert and used to train and validate ten ML systems. The best performing system confirmed the merits of using NLP and ML for judgment classification by achieving a macro-F1 score of 0.87 and accuracy of 0.93 on a test set of 100 judgments. This system was used to classify the remaining (unlabelled) judgments, giving a repository of 175 animal protection law judgments of which 92 were found automatically. Preliminary examination of the repository suggests it could aid the identification of individual animal protection law judgments and enhance understanding of the breadth of animal protection law created by courts.