Confusion Matrix Annotation
We have developed a practical methodology for refining the quality of the parser, using a form of semantic annotation by the domain expert (the civil law academic leading the course) of AWA’s output. Designed to be practical for the development of analytics tools with staff with limited time and resource, this is a rapid version of the more systematic coding that a qualitative data analyst would normally perform on a large corpus, using signal detection theory codes for True/False Positives/Negatives, to populate a confusion matrix:
| || ||
| || ||
Thus, the lecturer was asked to highlight True Positives and Negatives where she agreed with AWA’s highlighting and classification of a sentence, or its absence; False Positives where AWA highlighted a sentence she did not consider to be significant, and False Negatives where AWA ignored a sentence she considered to be important. We placed misclassifications of a sentence in this class too, as another form of failure to spot an important move.
We did not prepare particular annotation guides for the lecturer, since we cannot provide very detailed explanations of AWA highlights for the students or teachers either. As we described above AWA labels are based on complex patterns which would be far too cumbersome to describe in AWA. Our aim is to keep the AWA labels intuitively understandable, which is a challenging aspect of the UI. So, we defined the rhetorical moves informally in one sentence and gave some examples for each type. This first experiment served also as a test if the label names, the short explanations and the examples in AWA enable an analyst to grasp the semantics of the labels. We wanted to gain insight into the ways to improve the guide in the UI (rather than formally assessing the annotation scheme).
Starting with the generic rhetorical parser, the lecturer selected several essays with known grades. She pasted AWA’s output into Microsoft Word, and using the agreed scheme, annotated it with TP/FP/TN/FN plus explanatory margin comments. The linguist in turn commented on this document, for instance, explaining why AWA behaved the way it did when this confused the lecturer, or asking why a sentence was annotated as FP/FN.
This structured approach to analyst-academic communication began to build a corpus from which one could in principle calculate metrics such as precision, recall and F1; however, it is not yet large enough to calculate these reliably. Rather, the confusion matrix provided more focused feedback than informal comments to the team, aiding rapid iteration and dialogue, using a familiar tool (Word) and a simple 2 × 2 representation that required no training. We return to the importance of this process in the discussion on algorithmic accountability.
Refinements to AWA
For each of the cells in the confusion matrix, we consider examples of successes and failures, and the different adaptation measures that were taken to improve the signal/noise ratio.
Consistent with the intentions of the rhetorical analysis these sentences illustrate correct classification:
Summing up the main topic of the essay:
We found that sentences annotated as False Positives by the lecturer were falsely triggered by patterns that are often relevant in non-legal academic writing, but in law the same patterns are used as legal ‘terms of art’, for instance:
We can reduce False Positives in such cases by deleting the legal terms from the XIP lexicon, but the complication is that these words may also be used in their analytical sense. In such cases we implement disambiguation rules. In the following sentence “issue” is not used as a legal term, and so the sentence should be (and is) highlighted:
A few false negatives were due to the fact that analytical content in legal essays may use different words or expressions for conveying the constituent concepts from those that are parts of the existing lexicons. For example, neither “assess” nor “argument” was part of the lexicon, and thus the following sentence was not labeled. Once the words are added, the SUMMARY pattern is detected by the parser, and the sentence is labeled.
While one aspect of adaptation is the expansion of the lexicon, in fact the overwhelming majority of false negatives were due to sentences that the law academic coded as relevant in terms of her interpretation of the XIP categories, but which do not contain patterns coded in XIP.
For example, the lecturer expected the following sentence to be labeled as ‘C′:
This sentence does indeed convey a contrast. However, it is not labeled, because the contrast is not between two “ideas”, but between one effect of technology (i.e. it “has facilitated the retention of all records for businesses “) and Keane’s maintaining a “converse effect” of technology. Technically speaking even if this sentence does contain words that represent the relevant analytical concepts, it is not selected, since there is no syntactic relationship between any two of them. We can consider that this sentence partially fulfils the criteria for being selected, since it contains words instantiating some constituent concepts.
Were the sentence formulated in more precise terms, i.e. as a contrast between “ideas”, it would be highlighted, and labeled as ‘Contrast’, thus:
In this case we need to consider the extension of the current analysis, because it seems that the AWA patterns are too restrictive for the ‘C′ move.
The following sentence was expected by the lecturer to be labeled as ‘B′ Background knowledge:
This general description of the concept of “discovery” can legitimately be interpreted as “background knowledge”, however, it does not have the same semantics as ‘B′ in AWA. The semantics of the ‘B′ label in AWA is “reference to previous work”, as illustrated in the true positive ‘B′ sentence:
The role of the sentences annotated as false negatives in legal essay analytics needs to be elucidated before further adaptation is undertaken. On the one hand we need to revise the UI definitions and explanations so that they are in line with the users’ intuitions, and on the other hand, we need to consider the modification of discourse patterns to be detected in order to target more specifically legal discourse.
Taken together, the existing general analytical parser without any adaptation did provide relevant output for legal tests. Our data are too small for computing meaningful metrics, thus in Table 2 we report the result of the annotation exercise in terms of numbers of sentences.
This test indicated that lexical adaptation is required: deleting legal ‘terms of art’ from the general lexicon, and extending the lexicon with genre-specific vocabulary used in legal essays for conveying rhetorical moves. No syntactic parse errors were the cause of any False Negatives or False Positives. Even if some sentences in the legal essays are longer than average general texts sentences, this did not have an effect on the parser performance.
We started the lexical adaptation based on the test annotations. We created shared documents where we collected words to be added and deleted as we encountered them during the development process. Table 3 illustrates the list of legal ‘terms of art’ collected for deletion.
Currently, the implementation of changes (such as those introduced above) to XIP is performed by hand. We foresee the elaboration of mechanisms that automatically update the lexicons on user input or learn the domain vocabulary through machine learning.
No formal evaluation of the effect of the changes has been performed, but it is interesting to analyse the output of the slightly adapted parser on the document used for the annotation exercise. Having updated the lexicons following some basic adaptation the confusion matrix showed the results indicated in Table 4.
These changes resulted in a decrease in the number of False Positives with a commensurate increase in the number of True Negatives. This was due to the deletion of the legal terms from the general analytical lexicon. For example, the following sentence was highlighted as ‘Contrast’ in the general analytical parser, but not in the adapted legal parser, because of the elimination of issue, solution and problem from the lexicon.
The remaining False Positives and False Negatives are due to the differences of the definition of the rhetorical moves between the annotator and the general analytical parser. Further analysis is needed to determine if it is necessary to change the scope of the analytical parser by adapting the patterns to legal rhetorical moves.
Having taken the first steps in refining AWA for the legal domain, we moved into a first iteration of exposure to the students themselves.