The methodology used for machine annotation employed classification of the annotation units with a Naive Bayesian Multinomial Classifier based on a set of selected features described below.
Features for classification
The task of features selection focused on identifying the features that can help in classifying sentences. The following features were selected for extraction from the dataset:
Unigrams are widely used in text classification tasks. The performance of classifiers relying on bag-of-words approach can however be impeded by the assumption that words are independent; i.e., grammatical relations are not significant. To address this limitation researchers often complement unigrams with features that can capture dependencies between words. Dependency pairs derived using the Stanford Parser (Marneffe et al. 2006) were used to complement unigrams, creating word pairs that are grammatically linked rather than simply collocated like n-grams. Dependency features have previously been shown to be difficult to beat for a variety of text classifications tasks such as sentiment analysis (Joshi and Penstein-Rosé 2009) and stance classification (Hasan and Ng 2014; Mandya et al. 2016).
Part of speech tags were selected as a feature for a number of reasons. Firstly, it was expected that modal verbs and verb tense may help to classify the annotation units. Sentences that introduce facts are most often presented in the Past tense. For example:
The contract contained a general condition that in relation to any financial or other conditions either party could at any time before the condition was fulfilled or waived avoid the contract by giving notice.
Secondly, both epistemic and deontic modal qualifiers that use modal verbs are common in sentences containing legal principles, for example:
“It is a question which must depend on the circumstances of each case, and mainly on two circumstances, as indicating the intention, viz., the degree of annexation and the object of the annexation” (Cardigan v Moore & Anor, 2012).
“As a matter of principle no order should be made in civil or family proceedings without notice to the other side unless there is a very good reason for departing from the general rule that notice must be given” (Gorbunova v Berezovsky (aka Platon Elenin) & Ors, 2013).
In addition, we used three other features that captured the length of the sentence (number of words), its position in the text (on a scale of 0–1) and whether or not there is a citation in the sentence (boolean).
We used NLTK (Bird 2006) to extract part of speech tags and Stanford CoreNLP (Manning et al. 2014) to extract grammatical relations or dependencies. The other features were derived by means of a python script.
Machine learning framework
Our machine learning experiments were conducted using Weka (Hall et al. 2009), a collection of machine learning algorithms for data mining tasks. Given the limited amount of labeled data available, there was no developmental stage employed. We instead used linguistic features that we expected to be useful, relied on automatic feature selection to prune the feature set, ran a single machine learning algorithm with default settings and report results using a cross-validation methodology, as detailed below:
-
1.
Feature counts were normalised by tf and idf.
-
2.
Attribute selection (InfoGainAttributeEval in combination with Ranker (threshold = 0) search method) was performed over the entire dataset.
-
3.
The Naive Bayes Multinomial classifier was used for the classification task. This has been widely used in text classification tasks (Teufel et al. 2006; Mitchell 1997), and its performance is often comparable to more sophisticated learning methods (Schneider 2005).
-
4.
Results are reported for tenfold cross-validation. The 2659 sentences in the dataset were randomly partitioned into 10 subsamples. In each fold one of the subsamples was used for testing after training on the remaining 9 subsamples. Results are reported over the 10 testing subsamples, which constitute the entire dataset.
Results
Tables 2 and 3 report the classification performance of the Naive Bayes Multinomial classifier from the Weka toolkit (Hall et al. 2009). Feature selection reduced the number of features from 51576 to 887; we report more on the selected features later in this section.
The accuracy of the classifier is slightly better than that of the Annotator 2 (as reported in Sect. 4.2), who had no legal training in the manual study. The classifier achieves high precision and recall for each of the three categories, despite the unbalanced nature of the corpus (60% neutral, 30% principles and 10% facts). The confusion matrix in Table 3 shows that facts and principles are distinguished easily from each other. The majority of classification errors involve confusion with the neither category. These results suggest that to the extent such annotations can be carried out based on linguistic principles alone, automated annotation can be performed to the same standard as manual annotation.
Table 2 Per category and aggregated statistics for automatic classifier
Table 4 shows the top 100 features for this classification task. These are mostly either part-of-speech tags, unigrams such as ‘is’, ‘be’, ‘was’, ‘must’, ‘may’, ‘will’ and ‘can’ that indicate tense or modality, unigrams such as ‘a’, ‘an’ and ‘the’ that indicate definiteness, unigrams such as ‘if’, ‘whether’ and ‘where’ that can be used in stating conditions, as well as unigram and dependency pair features involving generic legal words such ‘principle’, ‘judgment’, ‘concerned’, ‘party’ ‘court’, ‘jurisdiction’, ‘case’, ‘condition’ etc. Only a small number of the features represent noise from overfitting the training data, including proper names such as ‘Charles’ and ‘Akso’ and numbers such as ’0’, ’100’, ‘300’. Using only these 100 features already achieves a reasonably high accuracy of 0.72 (\(\kappa =0.48\)), comfortably outperforming the majority class baseline (accuracy = 0.60, \(\kappa =0.00\)).
Table 4 Top 100 features by information gain
In the learnt Bayesian model, part of speech features such as ‘MD’ (modals), ‘VBZ’ and ‘VB’ (present tense), ‘WRB’ (WH-adverbs), as well as unigrams such as ‘a’ and ‘an’ (indefinites), ‘is’, ‘be’ and ‘has’ (present tense), ‘may’,‘must’, ‘will’ and ‘can’ (modals), ‘if’, ‘whether’, ‘or’, ‘unless’ and ‘where’ (conjunctions) have higher probability for the the principles class, as do general purpose nouns such as ‘person’ and ‘party’. This is intuitive as principles are often stated in present tense, scoped with a modal verb, presented in general terms (thus using indefinites or general nouns such as ‘person’ or ‘party’) and also typically relate more than one clause using conjunctions. On the other hand, the main features that predict facts are the use of past tense (‘VBD’, ‘VBN’,‘was’, ‘were’, ‘had’, ‘concerned’, etc.) and proper names (‘NNP’, ‘Mr’, ‘Mrs’, etc). The principle features that select for the neither class are the use of the first person (‘I’, ‘my’ and ‘judgment-my’), other references that indicate the sentence is about the current case rather than a cited one (‘this’, ‘case-this’, etc), words indicating that the judge is summarising or quoting (‘says’, ‘says-he’, ‘summarised’) and the use of adjectives and non-WH adverbs (‘JJ’ and ‘RB’), which might indicate an opinion being expressed.
Finally, we evaluated the performance of the classifier on each type of feature (part of speech, unigram and dependency) separately, as reported in Table 5 along with the majority class baseline that labels all sentences as neutral. While using only part of speech tags achieves 63% accuracy, dependency features by themselves achieve 81% accuracy. The combination of all three feature types results in the best results (85% accuracy). All feature sets outperform the baseline.
Table 5 Performance of each type of feature
Error analysis
A simple visual inspection of confusion output was performed and some hypotheses regarding the causes of confusion were made. In the gold standard corpus a variety of sentences containing facts have been annotated, which included sentences whose main purpose is other than introducing facts. In real life scenarios courts don’t always provide a detailed description of facts, but instead embed facts within legal reasoning. For this reason, sentences that contain the information about facts follow a wide variety of grammatical patterns. In combination with a relatively small amount of available instances such variety must have had a negative impact on the classification outcomes. For example, the machine annotator failed to identify the following statement as containing a fact:
“In Antec International Limited v Biosafety USA Inc [2006] EWHC 47 Mrs Justice Gloster was dealing with an application to set aside an order giving leave to serve abroad in a contractual claim where the contract contained a non-exclusive jurisdiction clause.” (Abela & Ors v Baadarani & Anor, 2011)
Visual inspection of instances suggests that confusion between fact and principle though rare overall may be typical in sentences whose aim is not to introduce facts and where factual information is used as a part of reasoning. Such sentences often only contain a short clause containing information about facts, so that in a small dataset, statistical weights associated with the rest of the sentence may outweigh those associated with the clause. For example:
“The fact that the parties have freely negotiated a contract providing for the non-exclusive jurisdiction of the English courts and English law creates a strong prima facie case that the English jurisdiction is the correct one” (Abela & Ors v Baadarani & Anor, 2011).
The main cause of error for the automatic annotation of principles was that the gold standard only annotated principles from cited cases, but often these were linguistically indistinguishable (in our machine learning approach) from discussions of principles by the current judge; i.e principles expressed by the current judge should have been annotated as neither, but were frequently annotated as principles.
Better results may be achieved if the annotation guidelines are redefined to be more specific about what constitutes a fact or principle, for instance, the fact class could be limited only to the sentences whose aim is to introduce facts. Introducing further features to determine the provenance of principles could help with the confusions between principles and neutral.