Methods for evaluating the agreement between diagnostic tests
- 4.6k Downloads
During the course of the development of a new diagnostic test, it is often necessary to compare the performance of the new test to that of an existing method. When the tests in question produce qualitative results (e.g., a test that indicates the presence or absence of a disease), the use of measures such as sensitivity/specificity or percent agreement is well established. For tests that lead to quantitative results, different methods are needed. The paper by Dunet et al published in this issue provides an example of an application of some of these methods.1
At the heart of this issue is quantifying the agreement between the results of two (or more) tests. That is, the two tests should yield similar results when applied to the same subject. Here, we consider the setting where we have a sample of subjects tested using both tests. A natural starting point in assessing agreement between quantitative results would be to consider the differences between the test results for each subject. While the paired t test could then be used to test whether the mean difference significantly differs from zero, this test cannot provide evidence that there is agreement. That is, rejecting the null hypothesis of no difference between the two sets of test results would only allow us to say that the tests do not agree; failing to reject this hypothesis would not constitute proof that the tests agree.
Another method of visually assessing whether two tests are in agreement is by constructing a scatterplot of the results from the first test against the results from the second test. If the two tests have good agreement, we should expect the points to fall on or near the 45° (i.e., y = x) line; departures from this line would indicate poor agreement. Although the Pearson correlation coefficient, ρ, may be used to assess the strength of a linear relationship between results of two tests, the Pearson correlation is not an appropriate means of assessing agreement: while it is true that the results from tests that agree well will have a high Pearson correlation, the converse is not always true as will be illustrated later.
The ICC ranges from 0 (no agreement) to 1 (perfect agreement).
The ICC has since been extended for use in a variety of settings. Bartko5 proposed using a two-way ANOVA model to account for rater effects, which can be either fixed (for a finite set of raters) or random (useful when the raters are selected from a larger pool). The model can also be extended to accommodate replicated measurements and/or subject by rater interactions. Different versions of the ICC have been defined for each of these varying ANOVA models. Shrout and Fleiss6 provide a useful discussion of the different forms of the ICC (see also McGraw and Wong7,8).
Note that if one of the tests is a reference or “gold standard,” then the bias is based on the difference between the new test’s result and the “true value” of the quantity being measured, and hence a measure of accuracy.10 For these cases, the CCC can be said to measure accuracy as well as consistency. But when neither test is a gold standard, it is not appropriate to state that CCC also provides a measure of accuracy.
As with the ICC, the CCC can be modified to handle replications or repeated measurements.11-13 Carrasco and Jover14 have shown that, in the case of no replications, the CCC is identical to an ICC defined using a two-way ANOVA model rather than a one-way ANOVA (see also Nickerson15). A detailed accounting of how the different versions of the CCC and ICC may be found in Chen and Barnhart.12,13,16
In their paper, Dunet et al1 use both Bland and Altman’s limits of agreement and Lin’s concordance correlation coefficient to assess the agreement between software packages. These two methods provide complementary pieces of information. The limits of agreement are useful for determining, when test results differ, whether those differences are likely to be clinically significant; use of the CCC yields a concise summary of the consistency and bias.
The authors have no conflicts of interest to disclose.
- 4.Fisher RA. Statistical methods for research workers. London: Oliver and Boyd; 1925.Google Scholar
- 10.US Food and Drug Administration. Guidance for industry: Bioanalytical method validation; 2001. http://www.fda.gov/downloads/Drugs/Guidances/ucm070107.pdf.