What do the experts know? Calibration, precision, and the wisdom of crowds among forensic handwriting experts
Forensic handwriting examiners currently testify to the origin of questioned handwriting for legal purposes. However, forensic scientists are increasingly being encouraged to assign probabilities to their observations in the form of a likelihood ratio. This study is the first to examine whether handwriting experts are able to estimate the frequency of US handwriting features more accurately than novices. The results indicate that the absolute error for experts was lower than novices, but the size of the effect is modest, and the overall error rate even for experts is large enough as to raise questions about whether their estimates can be sufficiently trustworthy for presentation in courts. When errors are separated into effects caused by miscalibration and those caused by imprecision, we find systematic differences between individuals. Finally, we consider several ways of aggregating predictions from multiple experts, suggesting that quite substantial improvements in expert predictions are possible when a suitable aggregation method is used.
KeywordsJudgment and decision-making Bayesian modeling Expertise Wisdom of crowds
What makes someone an expert? On the one hand, legal scholarship and rules of evidence often cite the importance of knowledge, skill, experience, training, or education in a particular discipline (for example in Rule 702 of the U.S. Federal Rules of Evidence, 2016 and Section 79 of the Australian Evidence Act (Cth), 1995; see Martire and Edmond (2017)). From a scientific perspective, both popular accounts (Gladwell, 2008; Ericsson & Pool, 2016) and scholarly literature on the development of expertise place an emphasis on the critical role of deliberate practice (Ericsson et al., 1993). Nevertheless, genuine expertise is more than mere experience: it should also produce expert performance, characterized as “consistently superior performance on a specified set of representative tasks for a domain” (Ericsson & Lehmann, 1996, p. 277). Yet in many situations “performance” is not at all straightforward to define, as there are very often no agreed upon ground truths or gold standards that can be used as the basis for assessment (Weiss et al., 2006). Indeed, this is the situation for many forensic science disciplines (Taroni et al., 2001).
An alternative approach suggests that expertise can be characterized in terms of the ability to make fine-grained discriminations in a consistent manner, as captured by measures such as the Cochran–Weiss–Shanteau (CWS) index (Weiss & Shanteau, 2003). This approach is appealing insofar as it can be applied even when objective gold standards are not available, but has one substantial drawback: it characterizes expertise in terms of the precision (i.e., discriminability and consistency) of expert performance rather than the accuracy (i.e., correctness).
One might hope that precision and accuracy are related, but in real life there are no guarantees that this is so. Whether one is considering the widespread belief in phrenology in the 19th century (Faigman, 2007), disproportionate trust in unreliable (Saks & Koehler, 2005) or unvalidated forensic methods (President’s Council of Advisors on Science and Technology, 2016), beliefs in conspiracy theories (Goertzel, 1994) or “groupthink” that plagues decision-making processes (Janis, 1982), it is clear that communities of people can come to considerable agreement about incorrect claims. A phrenologist might indeed produce very precise judgments, making fine-grained discriminations about a person’s character in a consistent way based on their physiognomy, but that does not mean the judgments of phrenologists are sufficiently accurate to be considered expert in a scientific sense. From a practical standpoint, therefore, the distinction between the precision of an individual and the accuracy of their judgments is of critical importance.
In this paper, we explore this question in a real-world domain, using forensic handwriting expertise as our testing ground. The domain is one of considerable practical importance: forensic handwriting examinations can be used to establish the origin of a questioned sample of handwriting for legal purposes (Dyer et al., 2006), and these judgments can be accorded considerable weight at trial. The task is heavily reliant on subjective judgment, with human examiners completing these assessments via visual comparison of handwriting samples (Dror & Cole, 2010). Traditionally handwriting and other feature-comparison examiners (e.g., fingerprint) have been permitted to make categorical judgments (‘match’ or ‘no-match’) without providing information about the uncertainty associated with their conclusion. Past research has suggested that forensic handwriting examiners are remarkably accurate in their ability to make these types of authorship determinations (e.g., Sita, Found, & Rogers, 2002; Found & Rogers, 2008; Kam, Gummadidala, Fielding, & Conn, 2001).
However, in the United States, the President’s Council of Advisors on Science and Technology 2016 and the National Academies of Science 2009 have both strongly criticized the traditional approach—arguing that unqualified categorical opinions are scientifically unsupportable. Consistent with these criticisms, many forensic scientists now endorse the use of likelihood ratios—the relative probability of the observations under different hypotheses as to their provenance—as the appropriate method for providing expert testimony (Aitken et al., 2011). As part of adopting this approach, handwriting examiners may be called upon to observe handwriting features and then assign probabilities to feature occurrence if, for example, two handwriting samples originated from the same versus different writers (Dror, 2016). How plausible is it that human handwriting experts are capable of producing genuinely superior performance than novices on this task?
On the one hand, there is cause for optimism: research examining visual statistical learning reveals that people can automatically and unconsciously learn statistical relationships from visual arrays given relatively limited exposure (Fiser & Aslin, 2001, 2002; Turk-Browne, Jungé, & Scholl, 2005). Further, experts have previously been found to have enhanced domain-specific statistical learning in comparison to novices (Schön & François, 2011). Indeed, it has sometimes been argued that the relevant probabilities (e.g., of seeing a particular handwriting feature given that two writing samples originated from the same versus different writers) can be estimated by the examiner based on their subjective experiences (Biedermann et al., 2013). On the other hand, the applied problem is essentially a probability judgment task, and there is considerable evidence that people tend not to be well calibrated at such tasks (e.g., Lichtenstein et al. (1982), but see Murphy and Daan (1984) in contexts where feedback is fast and accurate). Accordingly, there is some uncertainty as to whether forensic examiners will possess the relevant expertise in a fashion that would justify the use of such judgments in a legal context.
Our goal in this paper is to present the first empirical data examining this question, and in doing so, highlight the importance of distinguishing precision from accuracy in the assessment of expert performance. Our approach relies on a recently collected database of handwriting features (Johnson et al., 2016). This database was funded by the US National Institute of Justice (NIJ) to statistically estimate the frequency of handwriting features in a sample representative of the US adult population. We were able to access the estimates before they became publicly available. This allowed us to rely on a measure of “ground truth” that was not yet available to experts in the field. Using this database we were able to compare the performance of experts and novices, as well as people with high exposure to the relevant stimulus domain (US participants) and those whose experience pertains to a potentially different set of environmental probabilities (non-US participants).
One-hundred and fifty participants were recruited from forensic laboratories, mailing lists, and universities via email invitation. Participants not completing the experiment or not providing complete professional practice information in order for their expertise to be determined were excluded (n = 52). Two handwriting specialists who had not produced any expert reports or statements and one who was involved in the NIJ project were also excluded. The final sample comprised eighteen court-practicing handwriting specialists (henceforth ‘experts’; M = 149.9 investigative and court reports from 2010-2014, range = 9–1285; 8 US-based, 10 not) and 77 participants reporting no training, study or experience in handwriting analysis (36 US-based, 41 not). The participants that were not US-based were located in Australia (46.3%), Canada (3.2%), the Netherlands (2.1%), South Africa (1.1%), or Germany (1.1%). A $100 iTunes voucher was offered for the most accurate performance. Materials and data are available at https://osf.io/n2g4v/.
From an applied perspective, the critical question to ask is whether the expert judges are more accurate than novices, and our initial (planned) analyses consider this issue first. To that end, we adopted a Bayesian linear mixed model approach, using the BayesFactor package in R (Morey & Rouder, 2015), with the absolute magnitude of the error on every trial used as the dependent variable, and incorporating a random effect term to capture variability across the 95 participants and another to capture variability across the 60 features.
When analyzed in this fashion, there is strong evidence (Bayes factor 39:1 against the baseline model including only the random effects) that the expert judges were more accurate—average error 21% on any given trial—than the novices, who produced errors of 26% on average. However, the best performing model was the ‘full’ model that considered all four groups (US experts, US novices, non-US experts, non-US novices) separately, with a Bayes factor of 300:1 against the baseline and 3.7:1 against a model that includes both main effects and no interaction. Consistent with this, the data show a clear ordering: the most accurate group were the US experts (20% error), followed by the non-US experts (22% error). The novices were both somewhat worse, but curiously the non-US novices performed better (24% error) than the US novices (28% error).
Estimating individual expert knowledge
While it is reassuring that handwriting experts perform better on the judgment task than novices, the specific pattern of results—in which an interaction between expertise and country is observed—requires some deeper explanation. How does it transpire that experts from the US outperformed experts from outside the US, whereas US novices appeared to perform worse than non-US novices? Do forensic handwriting examiners possess superior knowledge of the underlying probabilities, are they more precise in how they report their knowledge, or both?
Moreover, these plots highlight the manner in which the overall error rate is not always the best guide as to expertise. On the one hand, participant #21 has the lowest error rate due to the fact that they have good calibration and high precision; and conversely, participant #81 has very high error due to their poor calibration and low precision. On the other hand, the imprecise responding of participant #3 leads to a very high error rate despite being very well calibrated. Arguably the responses of participant #3 reflect the ground truth better than participant #67—who always provides responses in the middle of the range—even though the latter has a much lower error.
Extracting wisdom from the forensic crowd
A natural question to ask of our data is whether there is a “wisdom of the crowd” effect (Surowiecki, 2005), in which it might be possible to aggregate the predictions of multiple participants to produce better judgments on the whole. Our sample includes a small number of real-world experts and a much larger number of novices, and as Fig. 5 illustrates, they differ dramatically in calibration and precision. Is it possible to aggregate the responses of these participants in a way that produces more accurate predictions than any individual expert? Does the inclusion of many poorly calibrated and imprecise novices hurt the performance of an expert aggregation model? We turn now to these questions.
One potentially problematic issue with the averaging approach is that it only works when the very poorly performing novices are removed from the data set—in many real-world situations it is not known a priori who is truly expert and who is not. We consider two methods for addressing this: the first is to rely on a robust estimator of central tendency such as the median, instead of using the arithmetic mean. As shown in Fig. 6, aggregation via the median produces better performance regardless of whether US experts (16.4%), all experts (16.1%) or all participants (15.7%) are included. The second method is to adopt a hierarchical Bayesian approach based on Lee and Danileiko (2014), using the more general calibration functions discussed in the previous section. The advantage of this approach is that it automatically estimates the response precision for each person, and learns in an unsupervised way which participants to weight most highly. As illustrated in the far right of Fig. 6, this model makes better predictions than the averaging method or the median response method: using only the US experts the prediction error is 15.1%, which falls slightly to 14.9% when all experts are used, and improves slightly further to 14.7% when the novices are included. In other words, not only is the hierarchical Bayes method robust to the presence of novices, it is able to use their predictions to perform better than it does if only the experts are considered and yields better estimates than simple robust methods such as median response.
The assessment of expert performance is a problem of considerable importance. Besides the obvious theoretical questions pertaining to the nature of expertise (e.g., Weiss & Shanteau, 2003; Ericsson & Lehmann, 1996), a variety of legal, professional and industrial fields are substantially reliant on human expert judgment. For the specific domain of forensic handwriting examination, the legal implications are the most pressing. In this respect, our findings are mixed.
On the one hand, there is some evidence that handwriting experts will be able to estimate the frequency of occurrence for handwriting features better than novices. However, even the single best performing participant produced an average deviation of 18.5% from the true value. On the other hand, this number is considerably lower than would be expected by chance (25%) if people possessed no relevant knowledge and simply responded with .5 on every trial, and using modern Bayesian methods to aggregate the predictions of the experts this can be reduced further to 14.7% error.
In short, these results provide some of the first evidence of naturalistic visual statistical learning in the context of forensic feature comparison. However, we suggest that a cautious approach should be taken before endorsing the use of experience-based likelihood ratios for forensic purposes in the future.
In this figure, the values on x-axis are the ground truth values taken from Johnson et al., (2016). Later in the paper we introduce a version of the model that treats the feature probabilities as unknown parameters: the calibration curves produced by that model are qualitatively the same as the curves plotted here.
As a subtle point, note that these intervals are constructed by treating the set of participants in each group as a fixed effect rather than a random effect, and as such we report the credible interval for the mean calibration difference for these specific participants, rather than attempting to draw inferences about a larger hypothetical population. This is a deliberate choice insofar as we are uncertain what larger population one ought to generalize to in this instance—it should be noted however that the credible intervals reported here necessarily correspond to a modest claim about these specific people rather than a more general claim about experts and novices.
The differences between the US experts and non-US experts are more difficult to characterize: different modeling assumptions produced slightly different answers for this question, beyond the original finding that US experts performed somewhat better overall.
- Budescu, D. V., & Johnson, T. R. (2011). A model-based approach for the analysis of the calibration of probability judgments. Judgment and Decision Making, 6, 857–869.Google Scholar
- Edwards, H., & Gotsonis, C. (2009). Strengthening forensic science in the United States: A path forward. Washington, DC: National Academies Press.Google Scholar
- Ericsson, K. A., & Pool, R. (2016). Peak: Secrets from the new science of expertise. Houghton Mifflin Harcourt.Google Scholar
- Faigman, D. L. (2007). Anecdotal forensics, phrenology, and other abject lessons from the history of science. Hastings Law Journal, 59, 979–1000.Google Scholar
- Gladwell, M. (2008). Outliers: The story of success. UK: Hachette.Google Scholar
- Goertzel, T. (1994). Belief in conspiracy theories. Political Psychology, 731–742.Google Scholar
- Janis, I. L. (1982) Vol. 349. Boston: Houghton Mifflin.Google Scholar
- Johnson, M. E., Vastrick, T. W., Boulanger, M., & Schuetzner, E. (2016). Measuring the frequency occurrence of handwriting and handprinting characteristics. Journal of Forensic Sciences.Google Scholar
- Lee, M. D., & Danileiko, I. (2014). Using cognitive models to combine probability estimates. Judgment and Decision Making, 9(3), 259–273.Google Scholar
- Lichtenstein, S., Fischhoff, B., & Phillips, L. (1982). Calibration of probabilities: The state of the art to 1980. In Kahneman, D., Slovic, P., & Tversky, A. (Eds.) Judgement under uncertainty: Heuristics and biases. New York: Cambridge University Press.Google Scholar
- Martire, K. A., & Edmond, G. (2017). Rethinking expert opinion evidence. Melbourne University Law Review, 40, 967–998.Google Scholar
- Merkle, E. C. (2010). Calibrating subjective probabilities using hierarchical Bayesian models. In International conference on social computing, behavioral modeling, and prediction, (pp. 13–22).Google Scholar
- Morey, R. D., & Rouder, J. N. (2015). BayesFactor: Computation of Bayes factors for common designs [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=BayesFactor (R package version 0.9.12-2).
- Plummer, M. (2003). JAGS: A Program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd international workshop on distributed statistical computing, (Vol. 124, p. 125).Google Scholar
- President’s Council of Advisors on Science and Technology (2016). Forensic science in criminal courts: Ensuring scientific validity of feature-comparison methods. Washington, DC: Executive Office of the President of the United States.Google Scholar
- Schön, D., & François, C. (2011). Musical expertise and statistical learning of musical and linguistic structures. Frontiers in Psychology, 2(167), 1–9.Google Scholar
- Surowiecki, J. (2005). The wisdom of crowds. Anchor.Google Scholar