Consistency plots: a simple graphical tool for investigating agreement in key comparisons

A simple graphical display is described for investigating agreement among interlaboratory data with reported uncertainties. The plot consists of a measure of agreement—the significance level of a test for significant pairwise difference, adjusted for multiple comparisons—plotted as an image in which significance is represented by colour or intensity. This provides an easily interpretable graphical presentation in which the degree of consensus can be judged and in which anomalies are easily visible. The construction of the plot is discussed, with attention to the choice of adjustment for the size of the data set in checking for anomalies. The advantage over visual inspection using error bars is discussed, and some examples from metrology comparisons are presented.


Introduction
Many inter-laboratory metrology comparisons, including many of the key comparisons required by the CIPM MRA [1,2], generate data in the form of reported results with stated uncertainties. A critical early step in the assessment of such data is the identification of anomalies [3]. This is essential to help identify and resolve possible technical issues before any summary or performance statistics are calculated.
Graphical inspection is widely recommended for initial data inspection, not least because "there can be surprises in the data" [4, p. 8]. However, the most common plot of key comparison data-the familiar dot-and-bar plot-is not a reliable indication of mutual consistency or validity of uncertainty estimates. Other useful graphical methods have been suggested; for example, Duewer's Concordance/Apparent Precision plot [5] and the Naji and (recently) Naji 2 plots [6][7][8]. These, however, rely on a reasonably reliable reference value. Many such studies do not have the advantage of an independently determined reference value with small uncertainty, leaving the choice of reference value open to debate and compromising graphical assessments that rely on a central location. This makes it important to seek alternative graphical tools that do not depend heavily on a reliable reference value.
Some procedures have been suggested that address the lack of a reliable reference value by reverting to pairwise comparisons (see, for example, [9,10]), and some can provide graphical output. A plot of the median scaled difference, for example, can provide a useful indicator of location/ uncertainty anomalies without any general location estimate [10]. It can, however, be useful to understand the structure and grouping of results in more detail, and for that purpose a more detailed presentation is useful.
In this paper, the shortcomings of the dot-and-bar plot, as an indicator of significant disagreement, are reviewed before offering an alternative graphical indicator of pairwise inconsistency that can be used for quick and reliable visual identification for important pairwise differences.

Methods
All statistical operations used the R statistical programming environment, version 4.1.3 [11]. Comparison plots used the metRology package for R, version 0.9-28-1 [12].

A critique of the dot-and-bar plot
The usual tool for displaying raw key comparison data is the dot-and-bar graph (sometimes-particularly when the quantity axis is horizontal-also called a 'caterpiller plot'). An example of a typical dot-and-bar chart is shown in Fig. 1.
The plot consists of individual results and associated uncertainty intervals. A central value is often included in the plot, which may have its own uncertainty (usually shown with an additional pair of lines bracketing the central line). This type of plot neatly displays the reported data, but as a tool for identifying inconsistencies it falls short in several respects.
First, the plot is sometimes constructed using standard uncertainties for the error bars. Although an accurate visual presentation of standard uncertainties, this gives no useful visual indication of the significance of any differences.
Second, visual comparison of plotted uncertainty intervals is often ambiguous. While cases (a) and (b) in Fig. 2 can always be taken as indicating a lack of significant difference at the chosen confidence level, cases (c) and (d) are less clear. For just two observations, cases like (c) and (d) may indicate significance anywhere between about 84 % and 99.4 %. These levels arise, respectively, when both observations are just outside each other's intervals, and when the intervals are just in contact (case (d) in Fig. 2), both with u 1 = u 2 . Nor is it straightforward to recognise the degree of overlap required for a chosen level of significance, as that depends on the relative sizes of the two uncertainties. As Fig. 3 shows, even for the simple case of only two observations, exact 95 % significance occurs with quite different extents of overlap, depending on the two uncertainties involved.
Third, visual inspection of a dot-and-bar plot for differences implicitly seeks differences among all possible pairwise comparisons; it is, implicitly, conducting multiple simultaneous hypothesis tests. For a study with n participants, there are n(n − 1)∕2 distinct pairwise comparisons. This is a well-known problem in general statistics. D'Errico [14] provides a useful review of the multiple comparison problem in the context of metrology; Elster and Toman [15] allude to it in the context of key comparison data evaluation.

Lab01
Lab02 Briefly, multiple pairwise comparisons drastically increase the probability of type 1 error, usually known in this context as the family-wise type 1 error rate (FWER). Type 1 error is incorrect rejection of a null hypothesis, that is, a 'false positive'. In the metrology comparison context, H0 corresponds to the hypothesis of no difference in (true) mean values. The FWER is the probability of incorrect rejection of this general null hypothesis, based on one or more apparently significant findings among a 'family' of multiple pairwise tests. The FWER for an uncorrected multiple pairwise test increases rapidly with the number of laboratories. In a ten-lab study where laboratories are performing identically, at least one apparently significant difference can be expected by chance if all pairwise tests are carried out at the usual 95 % level. This leads to incorrect conclusions and wasted time following up false positives. Over-interpretation because of implicit multiple comparisons, and under-interpretation because of the focus on interval overlap, can partly cancel in modest studies. But the actual false positive rate still depends on the size of the study. In very small studies, visual inspection of the dot-andbar plot underestimates the significance of differences, and for larger studies (ten or more) it quickly starts to overstate significance owing to the multiple comparison problem.
Finally, the standard dot-and-bar plot may need careful and lengthy inspection to identify all the anomalous results. Even when a reliable independent reference value is available, overlap between individual expanded uncertainty intervals and the corresponding interval for the key comparison reference value (KCRV) suffers exactly the same interpretation ambiguities as shown in Figs. 2 and 3. Without such a reference value, the viewer must visually compare each laboratory result with all of the others in turn, taking account of the ambiguities above, to identify possible concerns.
The dot-and-bar plot, then, remains a generally useful tool for presenting metrology comparison data, but it is not easily interpreted for marginal cases. For simple pairwise comparisons, it encourages under-interpretation of differences; for detecting general inconsistency, it can under-or over-interpret, depending on the size of the study and the actual inspection criterion used.

The pairwise consistency plot
One solution to the problems above is the 'consistency plot' proposed here. The plot consists of the p values for all tests for significant pairwise difference, adjusted for multiple comparisons, plotted as an image in which the degree of disagreement is represented by colour or colour intensity. This provides a very quick and reliable visual indication of mutual inconsistencies among results with associated uncertainties.
By way of example, consider the data displayed in Fig. 1. The task is, first, to check whether the reported uncertainties explain the observed dispersion and, second, to determine which results might be responsible for any apparent overdispersion.
A commonly recommended check for whether key comparison results are mutually consistent within their uncertainties is a Chi-squared test, based on differences between some central estimate and each result, each divided by the uncertainty of the difference [16]. For the data in Fig. 1, even omitting the obvious low outlier at 2.13 mol kg −1 , a Chi-squared test with the usual weighted mean as the central estimate returns a Chi-squared value of 167 for 12 degrees of freedom and a corresponding p value of 1.61 × 10 −29convincing evidence of appreciable overdispersion. While this provides a convenient summary test, however, it tells us nothing about which results are causing any excess dispersion. Inspection of Fig. 1 suggests at least laboratories 13 and 14 (as well as laboratory 2), but others also seem inconsistent with the provisional location, if not necessarily with one another. Checking whether all are singly or jointly responsible for the high Chi-squared value is a slow process. Instead, this more detailed question can be addressed by a multiple pairwise test; that is, each result is tested against every other result for a significant difference, with adjustment for multiple comparisons.
For most purposes in metrology, a z-or t-test is often sufficient for comparing a pair of observations with uncertainty. For independent results x 1 and x 2 with uncertainties u(x 1 ) and u(x 2 ) , respectively, the usual t-or z-tests calculate the test statistic The uncertainty term in the denominator can be adjusted for correlation if needed. The statistic can be interpreted either by comparison with an appropriate critical value or by inspecting a p value derived from the normal or t distribution, depending on whether degrees of freedom are important. With appropriate adjustment for the number of pairwise comparisons, this not only provides a test for inconsistency; the list of test statistics or, equivalently, their p values, shows which pairs of results appear to show significant difference.
The consistency plot proposed here is simply a plot of these p values, adjusted for multiple comparisons, with colour or colour intensity chosen for familiar thresholds for significance. The consistency plot for the data in Fig. 1 (data in table S1, supplementary information) is shown in Fig. 4. Interpretation is simple; a strong colour at the intersection for two laboratories (laboratories appear along both x-and y-axes) indicates a significant difference between the two, in this case after Holm adjustment for multiple tests [17]. The colour density indicates the significance level according to the associated key; absence of colour indicates no significant difference. For this data set, it is not surprising to see that the lowest result -indicated as extreme in Fig. 1-is inconsistent with all other results reported; this is shown by the essentially continuous band of strongly significant results for Lab02, along the left and lower edges of the plot. The two highest results, for laboratories 13 and 14, are also seen as clearly inconsistent with most other reported results, as might be expected from inspection of Fig. 1. However, there is more detail in the consistency plot. The results for laboratories 2 and 3 show additional significant pairwise differences for four laboratories (Lab11-Lab14) in addition to Lab02. The small uncertainty for the result from Lab11 leads to five additional strongly significant differences and two less compelling differences. Finally, the result for Lab10 shows marginally significant differences with two other laboratories. Overall, the plot clearly identifies multiple issues that suggest quite general inconsistency, with considerable detail to support follow-up investigation.
Some general features of the plot are also worth comment. First, the plot is diagonally symmetric, as laboratories are plotted in the same order on both axes and the two-sided test is symmetric, in the sense that the test used here returns the same p value irrespective of the sign of a difference between results. Second, the leading diagonal (from low left to top right, here) will always be consistently clear, as this would represent a laboratory result tested against itself. It is of course possible to mark the plot to show these as an exception; here, no special indication is given, in order to retain the focus on discrepancies.
Third, in the plot shown here, laboratories have been shown in order of ascending measured value. This accounts for the typically empty central region, in which eight or nine laboratories show no significant pairwise difference in this study. This makes it relatively simple to identify what appears to be a mutually consistent subset of results. It can also suggest the presence of multiple such subsets. For example, the empty region at top right of Fig. 4 might suggest a small but mutually consistent subset of three laboratories, though in this case it would be prudent to note the relatively large differences and uncertainties among those results. A different ordering could, however, change the plot appearance considerably and might offer advantages in some cases; the effect of ordering is discussed below, in connection with the example from CCQM-K86c.
Finally, with a Holm adjustment, the appearance of any significant difference in the plot is cause to reject a null hypothesis of equal means and accurate uncertainties. This follows because a Holm adjustment strongly preserves the FWER [17]; at the null, sampled data sets will generate one or more significant Holm-corrected p values with essentially the same frequency as a Chi-squared test will show significant inconsistency. The method of p value adjustment is, however, an important choice and is discussed further below.

Choice of p value adjustment for multiple hypothesis tests
Multiple test p value adjustment procedures all, to some extent, increase the individual pairwise test p values by an amount dependent on the intended outcome [18]. Adjustment methods can have different objectives. The most common is to preserve the FWER. For example, the simple and well-known Bonferroni correction for multiple pairwise hypothesis tests [19], implemented as a p value adjustment, simply requires multiplication of all p values by the number of pairwise tests and capping the returned p values at 1.0. To a good approximation this maintains the FWER of (say) a Chi-squared test on the complete data set. However, it will often do so at the expense of missing some marginally significant pairwise differences that might individually be worth closer inspection. As the data set size increases, this can quickly become problematic for a study organiser wishing to quickly identify multiple potentially anomalous points for closer inspection. Choice of p value adjustment method is therefore important. Figure 5 illustrates the effect of several well-known adjustment methods. Two of the procedures shown, the Bonferroni and Holm adjustments, strongly preserve FWER. Of the two, the Bonferroni adjustment (Fig. 5a) is simplest but Lab02 Lab14 Fig. 4 Consistency plot for the data in Fig. 1. Squares show the p value for a simple pairwise z-test between the two laboratories (identified by axis labels) represented by the intersection. The key shows the colour intensity for the resulting p value. For this plot, a Holm adjustment for multiple pairwise tests was used least powerful for identifying real pairwise differences. As the figure shows, the Bonferroni adjustment generates the smallest number of apparently important pairwise differences. Holm's method (Fig. 5b) is more powerful for identifying important pairwise differences, but still strongly preserves FWER; it is consequently to be recommended over Bonferroni's method for most purposes. An alternative objective for adjustment, however, is to balance high power for detection of real pairwise differences with good control of the FWER at the null. A widely used procedure that achieves this is that of Benjamini and Hochberg [20]. Instead of FWER, this procedure controls the false discovery rate; that is, the probability that an apparently 'significant' difference is just chance. At the null it provides good FWER control, but it also correctly identifies a higher proportion of real differences. For real differences it has consistently higher power than Holm's procedure [20]. As Fig. 5c) shows, this generates more potentially important differences among the cadmium data results. Where the principal intention is to identify possible differences for closer investigation, then, the Benjamini-Hochberg procedure should be preferred.
The figure includes a comparison with a plot using no p value adjustment, for completeness (Fig. 5d). In this case, the change from the Benjamini-Hochberg adjustment is modest. This is a fair indication that most of the 'significant' differences in Fig. 5d) are genuine; recall that the Benjamini-Hochberg procedure controls the false discovery rate, so only about 5 % of 'significant' results, after adjustment, will be incorrect. In other cases, where participants perform consistently, failing to adjust will lead to an unacceptably high proportion of apparently significant differences, and follow-up will waste study coordinator and participant time.

Further examples
The utility of these consistency plots is perhaps best illustrated by example. Here, two further examples are presented.  Figure 6a) shows reported results and uncertainties from a CCQM pilot study on electrical conductivity [21]; for the data, see table S2 in supplementary information. In this early study, the reported uncertainties vary considerably due to different measurement technologies in use at the time. As a result, the dot-and-bar plot in Fig. 6a) needs expansion (see inset) to inspect results for consistency and, even then, inspection remains challenging. The corresponding consistency plot in Fig. 6b), however, shows immediately that results from laboratories 2 through 6 differ significantly from those for laboratories 10 through 13, with laboratories 7 and 8 inconsistent with 11 and 13, and that laboratory 2 shows additional significant differences (adjusted p value < 0.05 ) compared to laboratories 5, 6, and 8. The overall picture is of agreement among results from laboratories 1, 3 through 9, and 14, but appreciable differences between the central group (3 through 9) and other results. Figure 7a) shows results and uncertainties for one of two materials examined in key comparison CCQM-K86c, a comparison on determination of a specified gene target ratio in Brassica napus (canola) [22]; for the data, see table S3 in supplementary information. The CCQM-K86 range of studies [22][23][24] supports the quantification of proportion of genetically modified (GMO) material present in otherwise wild-type materials. For this material, the nucleic acid target was a fragment introduced in modification event RT73,  Fig. 7 (a) Results for sample T2 in CCQM-K86C, an interlaboratory comparison on determination of a specified gene target ratio in canola [22]. Error bars show expanded uncertainties at ±2u . The horizontal line is the median of the accepted data. Results are ordered by arbitrary laboratory identifier. (b) Consistency plot for the data in (a), using a Benjamini-Hochberg adjustment. Laboratories (identified by axis labels) are ordered by assay and then by ascending measured value. Solid rectangles group pairwise comparisons among laboratories using the same assays for the reference gene (shown in bold inside the rectangle conferring glyphosate resistance. Agreement was unexpectedly poor, as can be seen in both the dot-and-bar plot and the corresponding consistency plot (Fig. 7). The example is of particular interest because the extensive disagreement visible in Fig. 7b) is associated with the particular assay used for the reference gene; four laboratories proved to have targeted the fatA(A) gene, the remainder targeting fatA. Although recommended for GMO determination in canola [25], the canola genome includes two copies of fatA but only one of the fatA(A) target. The result is effectively a different measurand, and a different copy number ratio, although a conversion is possible given information on the fatA/fatA(A) ratio in the canola genome. The consistency plot in Fig. 7b) identifies the resulting disagreement clearly, via the high proportion of low p values at top right and lower left; there is essentially no agreement across the two groups.
The plot also confirms that the two lowest values disagree consistently with essentially all other participants, despite appreciable overlap in confidence intervals for laboratories K08 and K10. The CCQM-K86c example also illustrates the importance of an appropriate choice of data ordering in such plots. The essentially random ordering in Fig. 7a) gives no strong indication of the two, and possibly three, distinct groups which are visible in Fig. 7b). It is also worth noting that, while random ordering in a consistency plot generates easily interpretable conclusions when there are relatively few significant pairwise differences, random or arbitrary ordering can obscure important patterns when there are many significant pairwise differences. For that reason, it will often be useful to start with a plot in which laboratories are ordered by ascending or descending reported value. A variation of Fig. 7a), in ascending value order, can be found in supplementary material.

Limitations and alternatives
The pairwise consistency plot described here is much better than the dot-and-bar plot for identifying significant differences among results with uncertainties. In the form suggested here, however, it has two limitations that may sometimes be important. First, it might not scale well to much larger data sets, such as a large proficiency testing exercise with hundreds of results. Although it will continue to identify inconsistency, it will become difficult to identify the specific participants responsible. Secondly, although the principle of displaying adjusted p values is relatively general, generating these becomes intricate when individual reported uncertainties are associated with appreciably different distributions. For small relative uncertainties, this is rarely a serious issue, but for relative uncertainties much over 20 %, asymmetry can become important and it may not be straightforward to provide pairwise hypothesis tests for difference in location.
Scaling is rarely an issue for metrology comparisons in the CIPM framework, as most comparisons are relatively small. It is also noteworthy that the dot-and-bar plot itself becomes crowded and hard to interpret as study size increases. For much larger interlaboratory comparisons, other tools are likely to be needed.
The problem of differing uncertainty distributions is more difficult to address. In principle, the probability of a given observed difference in location between observations from almost any pair of distributions can be obtained via simulation methods. In practice, this can be intractable; for example, the variance of many non-normal distributions depends on a true location, and constructing a null hypothesis for a pairwise test can be difficult. Measures other than hypothesis test probabilities might then be more appropriate. For example, although not examined here, plots based on indicators such as Kullback-Leibler divergence [26] and related measures could prove informative. In addition, it should not be forgotten that Cofino and others have long used a distribution overlap integral as a basis for a robust estimate of location [27,28], and plots of the pairwise overlap matrix are very similar in appearance to the plots described here [29]. The principal difference is that the consistency plots in the present paper provide a direct indication of pairwise test significance, whereas overlap integrals have not so far been used for that purpose.

Conclusions
As shown above, a consistency plot of the kind suggested here is a useful addition to the range of graphical methods for inspecting key comparison and similar data. With the right choice of p value adjustment, the plot provides both a test for general inconsistency and a clear indication of the particular differences that may need to be examined more closely. It has the additional advantage of not depending on a choice of estimator for a central value, making it particularly useful in initial review of studies with no independent reference value.
An implementation of the plot is available in the metRology [12] package for R. A simple spreadsheet implementation, using a Bonferroni adjustment, is available as electronic supplementary material for MS Excel versions that support the MS Excel SORTBY function (MS 365, available from mid-2020).