Introduction

What makes someone an expert? On the one hand, legal scholarship and rules of evidence often cite the importance of knowledge, skill, experience, training, or education in a particular discipline (for example in Rule 702 of the U.S. Federal Rules of Evidence, 2016 and Section 79 of the Australian Evidence Act (Cth), 1995; see Martire and Edmond (2017)). From a scientific perspective, both popular accounts (Gladwell, 2008; Ericsson & Pool, 2016) and scholarly literature on the development of expertise place an emphasis on the critical role of deliberate practice (Ericsson et al., 1993). Nevertheless, genuine expertise is more than mere experience: it should also produce expert performance, characterized as “consistently superior performance on a specified set of representative tasks for a domain” (Ericsson & Lehmann, 1996, p. 277). Yet in many situations “performance” is not at all straightforward to define, as there are very often no agreed upon ground truths or gold standards that can be used as the basis for assessment (Weiss et al., 2006). Indeed, this is the situation for many forensic science disciplines (Taroni et al., 2001).

An alternative approach suggests that expertise can be characterized in terms of the ability to make fine-grained discriminations in a consistent manner, as captured by measures such as the Cochran–Weiss–Shanteau (CWS) index (Weiss & Shanteau, 2003). This approach is appealing insofar as it can be applied even when objective gold standards are not available, but has one substantial drawback: it characterizes expertise in terms of the precision (i.e., discriminability and consistency) of expert performance rather than the accuracy (i.e., correctness).

One might hope that precision and accuracy are related, but in real life there are no guarantees that this is so. Whether one is considering the widespread belief in phrenology in the 19th century (Faigman, 2007), disproportionate trust in unreliable (Saks & Koehler, 2005) or unvalidated forensic methods (President’s Council of Advisors on Science and Technology, 2016), beliefs in conspiracy theories (Goertzel, 1994) or “groupthink” that plagues decision-making processes (Janis, 1982), it is clear that communities of people can come to considerable agreement about incorrect claims. A phrenologist might indeed produce very precise judgments, making fine-grained discriminations about a person’s character in a consistent way based on their physiognomy, but that does not mean the judgments of phrenologists are sufficiently accurate to be considered expert in a scientific sense. From a practical standpoint, therefore, the distinction between the precision of an individual and the accuracy of their judgments is of critical importance.

In this paper, we explore this question in a real-world domain, using forensic handwriting expertise as our testing ground. The domain is one of considerable practical importance: forensic handwriting examinations can be used to establish the origin of a questioned sample of handwriting for legal purposes (Dyer et al., 2006), and these judgments can be accorded considerable weight at trial. The task is heavily reliant on subjective judgment, with human examiners completing these assessments via visual comparison of handwriting samples (Dror & Cole, 2010). Traditionally handwriting and other feature-comparison examiners (e.g., fingerprint) have been permitted to make categorical judgments (‘match’ or ‘no-match’) without providing information about the uncertainty associated with their conclusion. Past research has suggested that forensic handwriting examiners are remarkably accurate in their ability to make these types of authorship determinations (e.g., Sita, Found, & Rogers, 2002; Found & Rogers, 2008; Kam, Gummadidala, Fielding, & Conn, 2001).

However, in the United States, the President’s Council of Advisors on Science and Technology 2016 and the National Academies of Science 2009 have both strongly criticized the traditional approach—arguing that unqualified categorical opinions are scientifically unsupportable. Consistent with these criticisms, many forensic scientists now endorse the use of likelihood ratios—the relative probability of the observations under different hypotheses as to their provenance—as the appropriate method for providing expert testimony (Aitken et al., 2011). As part of adopting this approach, handwriting examiners may be called upon to observe handwriting features and then assign probabilities to feature occurrence if, for example, two handwriting samples originated from the same versus different writers (Dror, 2016). How plausible is it that human handwriting experts are capable of producing genuinely superior performance than novices on this task?

On the one hand, there is cause for optimism: research examining visual statistical learning reveals that people can automatically and unconsciously learn statistical relationships from visual arrays given relatively limited exposure (Fiser & Aslin, 2001, 2002; Turk-Browne, Jungé, & Scholl, 2005). Further, experts have previously been found to have enhanced domain-specific statistical learning in comparison to novices (Schön & François, 2011). Indeed, it has sometimes been argued that the relevant probabilities (e.g., of seeing a particular handwriting feature given that two writing samples originated from the same versus different writers) can be estimated by the examiner based on their subjective experiences (Biedermann et al., 2013). On the other hand, the applied problem is essentially a probability judgment task, and there is considerable evidence that people tend not to be well calibrated at such tasks (e.g., Lichtenstein et al. (1982), but see Murphy and Daan (1984) in contexts where feedback is fast and accurate). Accordingly, there is some uncertainty as to whether forensic examiners will possess the relevant expertise in a fashion that would justify the use of such judgments in a legal context.

Our goal in this paper is to present the first empirical data examining this question, and in doing so, highlight the importance of distinguishing precision from accuracy in the assessment of expert performance. Our approach relies on a recently collected database of handwriting features (Johnson et al., 2016). This database was funded by the US National Institute of Justice (NIJ) to statistically estimate the frequency of handwriting features in a sample representative of the US adult population. We were able to access the estimates before they became publicly available. This allowed us to rely on a measure of “ground truth” that was not yet available to experts in the field. Using this database we were able to compare the performance of experts and novices, as well as people with high exposure to the relevant stimulus domain (US participants) and those whose experience pertains to a potentially different set of environmental probabilities (non-US participants).

Methods

Participants

One-hundred and fifty participants were recruited from forensic laboratories, mailing lists, and universities via email invitation. Participants not completing the experiment or not providing complete professional practice information in order for their expertise to be determined were excluded (n = 52). Two handwriting specialists who had not produced any expert reports or statements and one who was involved in the NIJ project were also excluded. The final sample comprised eighteen court-practicing handwriting specialists (henceforth ‘experts’; M = 149.9 investigative and court reports from 2010-2014, range = 9–1285; 8 US-based, 10 not) and 77 participants reporting no training, study or experience in handwriting analysis (36 US-based, 41 not). The participants that were not US-based were located in Australia (46.3%), Canada (3.2%), the Netherlands (2.1%), South Africa (1.1%), or Germany (1.1%). A $100 iTunes voucher was offered for the most accurate performance. Materials and data are available at https://osf.io/n2g4v/.

Stimuli

Participants were presented exemplars of handwriting features selected from the NIJ database (Johnson et al., 2016). Sixty feature exemplars (30 cursive and 30 printed handwritten forms) were selected, 12 with probabilities closest to each of five frequencies of occurrence: 1, 25, 50, 75, and 99%. The range of probabilities within each category was determined by the available exemplars (Fig. 1).

Fig. 1
figure 1

Stimuli and results. On each trial, participants were presented an image and verbal description of a handwriting feature exemplar. They were asked to estimate the percentage of the US population of adult writers who had the feature in their handwriting. The handwriting features varied according to case (upper or lower), style (cursive or printed) and occurrence categories (1% - top row, 25%, 50%, 75% and 99% - bottom row). The range of occurrence probabilities in each category is indicated on the left figure axis. The graphs on the right show experts’ and novices’ mean absolute error for each occurrence category by country (US or non-US). Error bars represent 95% within-cell confidence intervals

Procedure

Recruitment was necessarily time-limited and completed during the two weeks prior to the public release of the NIJ estimates. The task itself was straightforward: on every trial participants were presented a feature and asked “what percentage of the US population of adult writers have this feature in their handwriting?” using a number from 0 (never present) to 100 (present for all). On each trial, participants were shown images of handwritten letters and directed to the feature by text descriptions below (see Fig. 2). After completing all 60 trials, participants provided demographic and professional information.

Fig. 2
figure 2

Screenshot illustrating the task presented to participants

Results

The midpoint of the NIJ estimate range was subtracted from participant estimates to calculate absolute error for each trial. The results, averaged within-participant, are shown in Fig. 3. To illustrate how these results depend on the handwriting feature itself, Fig. 1 depicts the average error for the experts and novices, broken down by country of origin (US and non-US) and by rarity of feature (averaged within category).

Fig. 3
figure 3

Histogram showing the overall accuracy of every participant, defined as the average absolute error across the 60 features. Each marker represents a single participant, with the color indicating expertise level (black = expert, white = novice) and shape indicating country (circle = US, triangle = non-US)

Overall accuracy

From an applied perspective, the critical question to ask is whether the expert judges are more accurate than novices, and our initial (planned) analyses consider this issue first. To that end, we adopted a Bayesian linear mixed model approach, using the BayesFactor package in R (Morey & Rouder, 2015), with the absolute magnitude of the error on every trial used as the dependent variable, and incorporating a random effect term to capture variability across the 95 participants and another to capture variability across the 60 features.

When analyzed in this fashion, there is strong evidence (Bayes factor 39:1 against the baseline model including only the random effects) that the expert judges were more accurate—average error 21% on any given trial—than the novices, who produced errors of 26% on average. However, the best performing model was the ‘full’ model that considered all four groups (US experts, US novices, non-US experts, non-US novices) separately, with a Bayes factor of 300:1 against the baseline and 3.7:1 against a model that includes both main effects and no interaction. Consistent with this, the data show a clear ordering: the most accurate group were the US experts (20% error), followed by the non-US experts (22% error). The novices were both somewhat worse, but curiously the non-US novices performed better (24% error) than the US novices (28% error).

Estimating individual expert knowledge

While it is reassuring that handwriting experts perform better on the judgment task than novices, the specific pattern of results—in which an interaction between expertise and country is observed—requires some deeper explanation. How does it transpire that experts from the US outperformed experts from outside the US, whereas US novices appeared to perform worse than non-US novices? Do forensic handwriting examiners possess superior knowledge of the underlying probabilities, are they more precise in how they report their knowledge, or both?

To investigate these questions, we adopt a hierarchical Bayesian approach (e.g., Merkle, 2010; Lee & Danileiko, 2014) that seeks to distinguish between different kinds of errors in individual responses. As is often noted in statistics (e.g., Pearson, 1902; Cochran, 1968), the error associated with any measurement can be decomposed into different sources. For instance, researchers interested in probability judgment often estimate a “calibration curve” (e.g., Budescu & Johnson, 2011) that captures the tendency for judgments to reflect systematic bias, as opposed to the idiosyncratic variance associated with imprecise judgments. Inspired by recent work on expert aggregation models (e.g., Merkle, 2010; Lee & Danileiko, 2014), we explore these questions with the help of a hierarchical Bayesian analysis. If xk denotes the true probability of handwriting feature k, we adopt a two-parameter version of the calibration function used by Lee and Danileiko (2014),

$$\begin{array}{@{}rcl@{}} \psi_{ik} &=& \delta_{i} \log \frac{x_{k}}{1-x_{k}} + \log\frac{c_{i}}{1-c_{i}}\\ \mu_{ik} &=& \frac{\exp(\psi_{ik})}{1+\exp(\psi_{ik})} \end{array} $$

In this expression, μik denotes the subjective probability that person i assigns to feature k, and it depends on two parameters: the calibration δi describes the extent to which all subjective probabilities for the i-th person are biased towards some criterion value, specified by the parameter ci. This calibration curve describes the systematic error associated with the i-th participant. However, to capture the idea that responses yik also reflect unsystematic noise, we assume that these responses are drawn from normal distribution with mean μik, a standard deviation σi that is inversely related to the precision\(1/\sqrt {\tau _{i}}\) of the i-th participant, truncated to lie in the range [0,1]:

$$y_{ik} \sim \text{TruncNorm}(\mu_{ik},{\sigma_{i}^{2}},0,1) $$

For any given participant, the key quantities are therefore the parameters that describe their calibration function (δi, ci) and the precision parameter τi that characterizes the amount of noise in their responses. The model was implemented in JAGS (Plummer, 2003) and incorporated hierarchical priors over δ, c and τ: see Appendix for detail.

The critical characteristic of this model is that it allows us to distinguish systematic miscalibration from imprecise responding. To illustrate the importance of this distinction empirically, Fig. 4 displays the raw responses, average absolute error, estimated calibration curves and degree of response precision for six participants. As is clear from inspection, both sources of error matter: participants #3 and #21 are both extremely well calibrated, but they differ considerably in their precision. Participant #3 does not make fine-grained distinctions in their responding, and as a consequence they sometimes make very large errors (e.g., rating a feature that has probability .01 as having probability 1), whereas the responses from participant #21 tend to cluster much more tightly around the true value. Similarly, while the responses of participants #2 and #67 are almost as precise as those given by participant #21, neither one is particularly well calibrated and their responses are not strongly related to the true probabilities.

Fig. 4
figure 4

Illustration of how responses vary across individuals. Each panel represents data from a single person, with their overall error rate listed in the top left. Dots depict the response given by each person (y-axis) as a function of the ground truth probability for each item (x-axis). Solid lines depict the estimated calibration function that relates true probability to subjective probability for each person (dashed lines show perfect calibration), and the shaded areas depict 75% prediction intervals for each person. Some participants display good calibration (#21, #3), illustrated by the fact that the solid line lies close to the dashed line, while others do not (#2, #67, #13, #81). Some participants are relatively precise (#2, #21, #67) and have narrower prediction intervals than their less precise counterparts (#3, #13, #81). Finally, miscalibration can take different forms, with some participants responding in a way that is consistently too high (#2, #13) or too low (#81), and others responding in the middle for almost all items (#67)

Moreover, these plots highlight the manner in which the overall error rate is not always the best guide as to expertise. On the one hand, participant #21 has the lowest error rate due to the fact that they have good calibration and high precision; and conversely, participant #81 has very high error due to their poor calibration and low precision. On the other hand, the imprecise responding of participant #3 leads to a very high error rate despite being very well calibrated. Arguably the responses of participant #3 reflect the ground truth better than participant #67—who always provides responses in the middle of the range—even though the latter has a much lower error.

To illustrate this point more generally, Fig. 5 plots the estimated calibration curve for all 94 participants, grouped by expertise and geographical location, with the shading of each curve representing the estimated precision of that participant.Footnote 1 Overall, it is clear that the expert respondents are better calibrated than novices, with the curves on the left of the figure tending to sit closer to the dotted line than those on the right. Formally, we assign each person a calibration score corresponding to the sum squared deviation (across all 60 items) between the estimated subjective probability μik and the true value xk, and construct 95% credible intervals for the average difference in calibration scores between the members of different groups.Footnote 2 The expert respondents were indeed better calibrated than novices, for both the US-based participants (CI95 = [-5.4,-3.8]) and non-US participants (CI95 = [-1.5, -.007], though for the non-US participants the effect is modest. Similarly, we observe an difference in precision between experts and novices. Using a similar precision score that computes the sum squared deviation between the estimated subjective probability μik and the response yik we find between experts and novices for both the US-based (CI95 = [-2.3, -1.2]) and non-US-based respondents (CI95 = [-2.0, -1.0]).Footnote 3

Fig. 5
figure 5

Estimated calibration curves for all 94 participants, plotted separately by expertise status and geographical location. The shading of each curve represents the estimated precision, with darker curves corresponding to more precise responders

Extracting wisdom from the forensic crowd

A natural question to ask of our data is whether there is a “wisdom of the crowd” effect (Surowiecki, 2005), in which it might be possible to aggregate the predictions of multiple participants to produce better judgments on the whole. Our sample includes a small number of real-world experts and a much larger number of novices, and as Fig. 5 illustrates, they differ dramatically in calibration and precision. Is it possible to aggregate the responses of these participants in a way that produces more accurate predictions than any individual expert? Does the inclusion of many poorly calibrated and imprecise novices hurt the performance of an expert aggregation model? We turn now to these questions.

Figure 6 plots the performance of two different methods of aggregating participant responses, applied to three different versions of the data: one where we included all 94 participants, one where we used responses from the 17 experts, and a third where we used only the 8 US-based experts. Using the averaging method (near right), the crowd prediction is the average of each individual prediction. As the figure illustrates, this method does not yield a wisdom of crowds effect when the novices are included in the data. Using the average response of all 94 participants yields an average prediction error of 18.6%, slightly worse than the best individual participant in Fig. 3 whose error was 18.5%. Once the novices are removed, the prediction error falls to 16.8% (all experts) or 16.6% (US experts only). As the left panels of Fig. 6 illustrate, a decrease of 2% is about the same size as the difference between the best expert (18.5% error) and the median expert (20.4%).

Fig. 6
figure 6

Performance of the aggregation model (far right), when compared to simpler aggregation approaches that averages responses (middle) or takes the median response (near right). For comparison purposes, both are plotted against the performance of the best individual participant (near left), and the median performance of all respondents (far left). Within each panel, three versions are plotted: one where we included all 94 participants, one where we used responses from the 17 experts, and a third where we used only the 8 US-based experts

One potentially problematic issue with the averaging approach is that it only works when the very poorly performing novices are removed from the data set—in many real-world situations it is not known a priori who is truly expert and who is not. We consider two methods for addressing this: the first is to rely on a robust estimator of central tendency such as the median, instead of using the arithmetic mean. As shown in Fig. 6, aggregation via the median produces better performance regardless of whether US experts (16.4%), all experts (16.1%) or all participants (15.7%) are included. The second method is to adopt a hierarchical Bayesian approach based on Lee and Danileiko (2014), using the more general calibration functions discussed in the previous section. The advantage of this approach is that it automatically estimates the response precision for each person, and learns in an unsupervised way which participants to weight most highly. As illustrated in the far right of Fig. 6, this model makes better predictions than the averaging method or the median response method: using only the US experts the prediction error is 15.1%, which falls slightly to 14.9% when all experts are used, and improves slightly further to 14.7% when the novices are included. In other words, not only is the hierarchical Bayes method robust to the presence of novices, it is able to use their predictions to perform better than it does if only the experts are considered and yields better estimates than simple robust methods such as median response.

Finally, to understand why the Bayesian aggregation model performs better than the simpler averaging method, Fig. 7 plots the model-based estimates against the average human response (left) and compares both of these prediction methods to the ground truth estimated from the NIJ data (middle and right panels). As shown in the left panel, the main thing that the model has learned is to transform the average estimates via a non-linear calibration function that closely resembles curves used in standard theoretical models (Prelec, 1998), though as illustrated in Fig. 4 individual subject calibration curves often depart quite substantially from this shape. This has differential effects depending on the true frequency of the handwriting feature in question: for very rare and very common features, the model reverses the regression to the mean effect, and so the Bayesian model produces much better estimates for these items (see also Satopää et al., 2014). The overall result is that the model estimates produce a much better prediction about the ground truth (middle panel) than the averaging method (right panel).

Fig. 7
figure 7

Aggregating expert predictions. Pairwise comparisons between the ground truth estimated from the NIJ data, the average response of human participants, and the estimates extracted via the hierarchical Bayesian model

Conclusions

The assessment of expert performance is a problem of considerable importance. Besides the obvious theoretical questions pertaining to the nature of expertise (e.g., Weiss & Shanteau, 2003; Ericsson & Lehmann, 1996), a variety of legal, professional and industrial fields are substantially reliant on human expert judgment. For the specific domain of forensic handwriting examination, the legal implications are the most pressing. In this respect, our findings are mixed.

On the one hand, there is some evidence that handwriting experts will be able to estimate the frequency of occurrence for handwriting features better than novices. However, even the single best performing participant produced an average deviation of 18.5% from the true value. On the other hand, this number is considerably lower than would be expected by chance (25%) if people possessed no relevant knowledge and simply responded with .5 on every trial, and using modern Bayesian methods to aggregate the predictions of the experts this can be reduced further to 14.7% error.

In short, these results provide some of the first evidence of naturalistic visual statistical learning in the context of forensic feature comparison. However, we suggest that a cautious approach should be taken before endorsing the use of experience-based likelihood ratios for forensic purposes in the future.