What is Meant by “Imperfect Reference Standard” and Why is it Important for Meta-Analysis and Synthesis in General?
Perhaps the simplest case of test performance evaluation includes an “index test” and a reference test (“reference standard”) whose results are dichotomous in nature (or are made dichotomous). Both tests are used to inform on the presence or absence of the condition of interest, or predict the occurrence of a future event. For the vast majority of medical tests, both the results of the index test and the reference test can be different than the true status of the condition of interest. Figure 1 shows the correspondence between the true 2 × 2 table probabilities (proportions) and the eight strata defined by the combinations of index and reference test results and the presence or absence of the condition of interest. These 8 probabilities (α
1
, β
1
, γ
1
, δ
1
, α
2
, β
2
, γ
2
and δ
2
) are not known, and have to be estimated from the data (from studies of diagnostic or prognostic accuracy). More accurately, a study of diagnostic accuracy tries to estimate quantities that are functions of the eight probabilities.
Diagnostic Accuracy—the Case of the “Perfect” Reference Standard
A “perfect” reference standard would be infallible, always match the condition of interest, and, thus, in Figure 1 the proportions in the grey boxes (α
2
, β
2
, γ
2
and δ
2
) would be zero. The data in the 2 × 2 table are then sufficient to estimate the four remaining probabilities (α
1
, β
1
, γ
1
, and δ
1
). Because the four probabilities necessarily sum to 1, it is sufficient to estimate any three. In practice, one estimates three other parameters, which are functions of the probabilities in the cells, namely, the sensitivity and specificity of the index test and the prevalence of the condition of interest (Table 1). If the counts in the table 2 × 2 are available (e.g., from a cohort study assessing the index test’s performance), one can estimate the sensitivity and the specificity of the index test in a straightforward manner: \( \widehat{{S{e_{{index}}}}} = \frac{{{\text{true}}\;{\text{positives}}}}{{{\text{true}}\;{\text{positives}}\;{\text{ + false}}\;{\text{negatives}}}} \), and \( \widehat{{S{e_{{index}}}}} = \frac{{{\text{true}}\;{\text{negatives}}}}{{{\text{true}}\;{\text{negatives}}\;{\text{ + false}}\;{\text{positives}}}} \), respectively.
Table 1 Parameterization When the Reference Standard is Assumed “Perfect” (“Gold Standard”)
Diagnostic Accuracy—the Case of the “Imperfect” Reference Standard
Only rarely are we sure that the reference standard is a perfect reflection of the truth. Most often in our assessments we accept some degree of misclassification by the reference standard, implicitly accepting it as being “as good as it gets”. Table 2 lists some situations where we might question the validity of the reference standard. Unfortunately, there are no hard and fast rules for judging the adequacy of the reference standard; systematic reviewers should consult content experts in making such judgments.
Table 2 Situations Where One Can Question the Validity of the Reference Standard
Table 3 shows the relationship between the sensitivity and specificity of the index and reference tests and the prevalence of the condition of interest when the results of the index and reference tests are independent among those with and without the condition of interest (“conditional independence”, one of several possibilities). For conditionally independent tests, estimates of sensitivity and specificity from the standard formulas (“naïve estimates”) are always smaller than the true values (see an example in Fig. 2, and later for a more detailed discussion).
Table 3 Parameterization When the Reference Test is Assumed to be Imperfect, and the Index and Reference Test Results are Assumed Independent within the Strata of the Condition of Interest
Options for Systematic Reviewers
So how should one approach the challenge of synthesizing information on diagnostic or prognostic tests when the purported “reference standard” is judged to be inadequate? At least four options exist. The first two change the framing of the problem, and forgo the classical paradigm for evaluating test performance. The third and fourth work within the classical paradigm and rely on qualifying the interpretation of results, or on mathematical adjustments:
-
1.
Forgo the classical paradigm; assess the index test’s ability to predict patient relevant outcomes instead of test accuracy (i.e., treat the index test as a predictive instrument).5,6 This reframing applies when outcome information (usually on long term outcomes) exists, and the measured patient outcomes are themselves valid. If so, the approach to such a review is detailed in Chapter 11 in this supplement of the Journal.7
-
2.
Forgo the classical paradigm; assess simply whether the results of the two tests (index and reference) agree or disagree (i.e., treat them as two alternative measurement methods). Instead of calculating sensitivity and specificity one would calculate statistics on test concordance, as mentioned later.
-
3.
Work within the classical paradigm, and calculate “naïve estimates” of the index test’s sensitivity and specificity from each study, but qualify study findings.
-
4.
Adjust the “naïve” estimates of sensitivity and specificity of the index test to account for the imperfect reference standard.
Our subjective assessment is that, when possible, the first option is preferred as it recasts the problem into one that is inherently clinically meaningful. The second option may be less clinically meaningful, but is a defensible alternative to treating an inadequate reference standard as if it were effectively perfect. The third option is potentially subject to substantial bias, which is especially difficult to interpret when the results of the test under review and the “reference standard” are not conditionally independent (i.e., an error in one is more or less likely when there is an error in the other). The forth option would be ideal if the adjustment methods were successful (i.e., eliminated biased estimates of sensitivity and specificity in the face of an imperfect reference standard). However, the techniques available necessarily require information that is typically not included in the reviewed studies, and require advanced statistical modeling.
-
1.
Assess index test’s ability to predict patient-relevant outcomes instead of test accuracy.
This option is not universally possible. Instead of assessing the diagnostic or screening performance of the test, it quantifies the impact of patient management strategies that include testing on (usually long term) clinical outcomes. When it is possible and desirable to recast the evaluation question as an assessment of a tests ability to predict health outcomes, there are specific methods to consider when performing the assessment. For a more detailed discussion, the reader is referred to Paper 11 in this supplement of the Journal.7
-
2.
Assess the concordance of difference tests instead of test accuracy
Here, the index and reference tests are treated as two alternative measurement methods. One explores how well one test agrees with the other test(s), and perhaps if one test can be used in the place of the other. Assessing concordance may be the only meaningful option if none of the compared tests is an obvious choice for a reference standard (e.g., when both tests are alternative methodologies to measure the same quantity).
In the case of categorical test results, one can summarize the extent of agreement between two tests using Cohen’s κ statistic (a measure of categorical agreement which takes into account the probability that some agreement will occur by chance). A meta-analysis κ of statistics may also be considered to supplement a systematic review8; because it is not common practice in the medical literature, such a meta-analysis should be explained and interpreted in some detail.
In the case of continuous test results, one is practically limited by the data available. If individual data points are available or extractable (e.g., in appendix tables or by digitizing plots) one can directly compare measurements with one test versus measurements with the other test. One way is to perform an appropriate regression to obtain an equation for translating the measurements with one test to the measurements of the other. Because both measurements have random noise, an ordinary least squares regression is not appropriate; it treats the “predictor” as fixed and error-free, and thus underestimates the slope of the relationship between the two tests. Instead one should use a major axis or similar regression,9–12 or more complex regressions that account for measurement error; consulting a statistician is probably wise. An alternative and well-known approach is to perform difference versus average analyses (Bland–Altman-type of analyses13–15. A qualitative synthesis of information from Bland–Altman plots can be quite informative (see example).16 As of this writing the authors have not encountered any methods for incorporating difference versus average information from multiple studies.
If individual data points are not available, one has to summarize study-level information of the agreement of individual measurements. Of importance, care is needed when selecting which information to abstract. Summarizing results from major axis regressions or Bland–Altman analyses is probably informative. However, other metrics are not necessarily as informative. For example, Pearson’s correlation coefficient, while often used to “compare” measurements with two alternative methods, is not a particularly good metric for two reasons: First, it does not inform on the slope of the line describing the relationship between the two measurements; it informs on the degree of linearity of the relationship. Further, its value can be high (e.g., >0.90) even when the differences between the two measurements are clinically important. Thus, one should be circumspect in using and interpreting a high Pearson’s correlation coefficient for measurement comparisons.
-
3.
Qualify the interpretation of “naïve” estimates of the index test’s performance
This option is straightforward. One could obtain “naïve” estimates of index test performance and make qualitative judgments on the direction of the bias of these “naïve” estimates.
Tests with Independent Results within the Strata of the Disease
We have seen already in Table 3 that, when the results of the index and reference test are independent among those with and without the disease (conditional independence), the “naïve” sensitivity and specificity of the index test is biased down. The “more imperfect” the reference standard, the greater the difference between the “naïve” estimates and true test performance for the index test (Fig. 2).
Tests with Correlated Results within the Strata of the Disease
When the two tests are correlated conditional on disease status, the “naïve” estimates of sensitivity and specificity can be overestimates or underestimates, and the formulas in Table 3 do not hold. They can be overestimates when the tests tend to agree more than expected by chance. They can be underestimates when the correlation is relatively small, or the tests disagree more than expected by chance.
A clinically relevant example is the use of prostate-specific antigen (PSA) to detect prostate cancer. PSA levels have been used to detect the presence of prostate cancer, and over the years, a number of different PSA detection methods have been developed. However, PSA levels are not elevated in as many as 15 % of individuals with prostate cancer, making PSA testing prone to misclassification error.17 One explanation for these misclassifications (false-negative results) is that obesity can reduce serum PSA levels. The cause of misclassification (obesity) will likely affect all PSA detection methods—patients who do not have elevated PSA by a new detection method are also likely to not have elevated PSA by the older test. This “conditional dependence” will likely result in an overestimation of the diagnostic accuracy of the newer (index) test. In contrast, if the newer PSA detection method was compared to a non-PSA based reference standard that would not be prone to error due to obesity, such as prostate biopsy, conditional dependence would not be expected and estimates of diagnostic accuracy of the newer PSA method would likely be underestimated if misclassification occurs.
Because of the above, researchers should not assume conditional independence of test results without justification, particularly when the tests are based upon a common mechanism (e.g., both tests are based upon a particular chemical reaction, so that something which interferes with the reaction for one of the tests will likely interfere with the other test as well).18
-
4.
Adjust or correct the “naïve” estimates of sensitivity and specificity
Finally, one can mathematically adjust or correct the “naïve” estimates of sensitivity and specificity of the index test to account for the imperfect reference standard. The 2 × 2 cross-tabulation of test results is not sufficient to estimate the true sensitivities and specificities of the two tests, the prevalence of the conditions of interest, and correlations between sensitivities and specificities among those with and without the condition of interest. Therefore, additional information is needed. Several options have been explored in the literature. The following is by no means a comprehensive description; it is just an outline of several of the numerous approaches that have been proposed.
The problem is much easier if one can assume conditional independence for the results of the two tests, and further, that some of the parameters are known from prior knowledge. For example, one could assume that the sensitivity and specificity of the reference standard to detect true disease status is known from external sources, such as other studies,19 or that the specificities for both tests are known (from prior studies) but the sensitivities are unknown.20 In the same vein one can encode knowledge from external sources with prior distributions instead of fixed values, using Bayesian inference.21–24 Using a whole distribution of values rather than a single fixed value is less restrictive and probably less arbitrary. The resulting posterior distribution provides information on the specificities and sensitivities of both the index test and the reference standard, and of the prevalence of people with disease in each study.
When conditional independence cannot be assumed, the conditional correlations have to be estimated as well. Many alternative parameterizations for the problem have been proposed.18,25–31 It is beyond the scope of the paper to describe them. Again, it is advisable to seek expert statistical help when considering such quantitative analyses, as modeling assumptions can have unanticipated implications32 and model misspecification can result in biased estimates.33
Illustration
As an illustration we use a systematic review on the diagnosis of obstructive sleep apnea (OSA) in the home setting.16 Briefly, OSA is characterized by sleep disturbances secondary to upper airway obstruction. It is prevalent in 2 to 4 % of middle-aged adults, and has been associated with daytime somnolence, cardiovascular morbidity, diabetes and other metabolic abnormalities, and increased likelihood of accidents and other adverse outcomes. Treatment (e.g., with continuous positive airway pressure) reduces symptoms, and, hopefully, long term risk for cardiovascular and other events. There is no “perfect” reference standard for OSA. The diagnosis of OSA is typically established based on suggestive signs (e.g. snoring, thick neck) and symptoms (e.g., somnolence), and in conjunction with an objective assessment of breathing patterns during sleep. The latter is by means of facility-based polysomnography, a comprehensive neurophysiologic study of sleep in the lab setting. Most commonly, polysomnography quantifies one’s apnea-hypopnea index (AHI) (i.e., how many episodes of apnea [no airflow] or hypopnea [reduced airflow] a person experiences during sleep). Large AHI is suggestive of OSA. At the same time, portable monitors can be used to measure AHI instead of facility based polysomnography.
Identifying (Defining) the Reference Standard
One consideration is what reference standard is most common, or otherwise “acceptable”, for the main analysis. In all studies included in the systematic review, patients were enrolled only if they had suggestive symptoms and signs (although it is likely that these were differentially ascertained across studies). Therefore, in these studies, the definition of “sleep apnea” is practically equivalent to whether people have a “high enough” AHI.
Most studies and some guidelines define AHI ≥ 15 events per hour of sleep as suggestive of the disease, and this is the cut-off selected for the main analyses. In addition, identified studies used a wide range of cut-offs in the reference method to define sleep apnea (including 5, 10, 15, 20, 30, and 40 events per hour of sleep). As a sensitivity analysis, the reviewers decided to summarize studies also according to the 10 and the 20 events per hour of sleep cut-offs; the other cut-offs were excluded because data was sparse. It is worth noting that, in this case, the exploration of the alternative cut-offs did not affect the results or conclusions of the systematic review, but did require substantial time and effort.
Deciding How to Summarize the Findings of Individual Studies and How to Present Findings
The reviewers calculated “naïve” estimates of sensitivity and specificity of portable monitors, and qualified their interpretation (option 3). They also performed complementary analyses outside the classical paradigm for evaluating test performance to describe the concordance of measurements with portable monitors (“index” test) and facility-based polysomnography (“reference” test; this is option 2 ).
Qualitative Analyses of “Naïve” Sensitivity and Specificity Estimates
The reviewers depicted graphs of the “naïve” estimates of sensitivity and specificity in the ROC space (see Fig. 3). These graphs suggest a high “sensitivity” and “specificity” of portable monitors to diagnose AHI ≥ 15 events per hour with facility-based polysomnography. However, it is very difficult to interpret these high values. First, there is considerable night-to-night variability in the measured AHI, as well as substantial between-rater and between-lab variability. Second, it is not easy to deduce whether the “naïve” estimates of “sensitivity” and “specificity” are underestimates or overestimates compared to the unknown “true” sensitivity and specificity to identify “sleep apnea.”
The systematic reviewers suggested that a better answer would be obtained by studies that perform a clinical validation of portable monitors (i.e., their ability to predict patients’ history, risk propensity, or clinical profile—this would be option 1) and identified this as a gap in the pertinent literature.
Qualitative Assessment of the Concordance Between Measurement Methods
The systematic reviewers decided to summarize Bland–Altman type analyses to obtain information on whether facility-based polysomnography and portable monitors agree well enough to be used interchangeably. For studies that did not report Bland–Altman plots, the systematic reviewers performed these analyses using patient-level data from each study, extracted by digitizing plots. An example is shown in Figure 4. The graph plots the differences between the two measurements against their average (which is the best estimate of the true unobserved value). An important piece of information from such analyses is the range of values defined by the 95 percent limits of agreement (i.e., the region in which 95 % of the differences are expected to fall). When the 95 % limits of agreement are very broad, the agreement is suboptimal (Fig. 4).
Figure 5 summarizes such plots across several studies. For each study, it shows the mean difference in the two measurements (mean bias) and the 95 % limits of agreement. The qualitative conclusion is that the 95 % limits of agreement are very wide in most studies, suggesting great variability in the measurements with the two methods.
Thus, AHI measurements with the two methods generally agree on who has 15 or more events per hour of sleep (which is a low AHI). They disagree on the exact measurement among people who have larger measurements on average: One method may calculate 20 and the other 50 events per hour of sleep for the same person. The two methods are expected to disagree on who has AHI for those with >20, >30, or >40 events per hour.