Background

Medical tests are used to inform screening, diagnosis and prognosis in medicine. Meta-analysis methods are increasingly used to synthesise the evidence about a test’s accuracy from multiple studies, to produce summary estimates of sensitivity and specificity [14]. When the test is measured on a continuous scale, many studies report test performance at multiple thresholds, each relating to a different choice of threshold above which test results are classed as ‘positive’ and below which test results are classed as ‘negative’. Unfortunately, most primary studies do not report the same set of thresholds. For example, in an evaluation of the spot protein:creatinine ratio (PCR) for detecting significant proteinuria in pregnancy, Morris et al. [5] extracted tables for 23 different thresholds across 13 studies; eight of the thresholds were considered by just one study, but the other 15 thresholds were considered in two or more studies (Table 1), with a maximum of six studies for any threshold. In this situation, meta-analysts generally either utilise the results for just one of the thresholds per study or utilise results for all reported thresholds but perform a separate meta-analysis for each of the thresholds independently [6]. However, an approach that considers meta-analysis for each threshold independently will omit any studies that do not report the threshold of interest and thus also ignore information from other thresholds that are available in those studies.

Table 1 PCR results for each threshold in each of the 13 studies of Morris et al. [5]

In this article, we propose an exploratory method (a sensitivity analysis) to help researchers examine the potential impact of missing thresholds on their meta-analysis conclusions about a test’s accuracy. The method first imputes results in studies where missing thresholds are bounded between a pair of known thresholds; missing results are also bounded because as the threshold value increases, sensitivity must decrease and specificity must increase. The imputed results are then added to the meta-analysis, and this allows researchers to evaluate whether their original conclusions are robust. This is especially important when thresholds are prone to selective reporting bias; that is, they are less likely to be reported when they give lower values of sensitivity and/or specificity. In this situation, meta-analysis may otherwise produce summary sensitivity and specificity results that are too high (i.e. biased).

The article is structured as follows. In the “Motivating example” section, we describe the motivating PCR dataset in detail. In the “Methods” section, we describe our imputation method, explain its assumptions and perform an empirical evaluation. The “Results” section applies it to the PCR data, and the “Discussion” section concludes by considering the strengths and limitations of the method and further research.

Motivating example: identification of significant proteinuria in patients with suspected pre-eclampsia

Pre-eclampsia is a major cause of maternal and perinatal morbidity and mortality and occurs in 2%–8% of all pregnancies [710]. The diagnosis of pre-eclampsia is determined by the presence of elevated blood pressure combined with significant proteinuria (≥0.3 g per 24 h) after the 20th week of gestation in a previously normotensive, non-proteinuric patient [11]. The gold-standard method for detection of significant proteinuria is the 24-h urine collection, but this is cumbersome, time consuming and inconvenient, to patients as well as hospital staff. There is therefore a need for a rapid and accurate diagnostic test to identify significant proteinuria to allow more timely decision-making.

The spot PCR has been shown to be strongly correlated with 24-h protein excretion and thus is a potential diagnostic test for significant proteinuria. Morris et al. [5] performed a systematic review and meta-analysis to assess the diagnostic accuracy of PCR for the detection of significant proteinuria in patients with suspected pre-eclampsia. Thirteen relevant studies were identified, and in each study, the reference standard was proteinuria greater than or equal to 300 mg in urine over 24 h. Across the 13 studies, 23 different threshold values were considered for PCR, ranging from 0.13 to 0.50. Five studies provided diagnostic accuracy results (i.e. a two-by-two table showing the number of true positives, false positives, false negatives and true negatives) for just one threshold, but the other eight studies reported results for each of multiple thresholds, up to a maximum of nine thresholds (Yamasmit study). Eight of the 23 thresholds were considered by just one study, but the other 15 thresholds were considered in two or more studies, up to a maximum of six studies (for threshold 0.20). The studies and thresholds are summarised in Table 1.

Meta-analysis is important here to summarise the diagnostic accuracy of PCR at each threshold from all the published evidence, to help ascertain whether PCR is a useful diagnostic test and, if so, which threshold is the most appropriate to use in clinical practice. However, this is non-trivial given the number of thresholds available, the variation in how many studies report each threshold and the likely similarity between neighbouring threshold results. The PCR data is thus an ideal dataset to motivate and apply the statistical methods developed during the remainder of the paper.

Methods

We now propose our exploratory method for examining the impact of missing thresholds in meta-analysis of test accuracy studies.

Exploratory method to examine the potential impact of missing thresholds

Let there be i = 1 to m studies that measure a continuous test result on n1i diseased patients and n0i non-diseased patients, whose true disease status is provided by a reference standard. In each study, at a particular threshold value, x, each patient’s measured test value is classed as either ‘positive’ (≥ x) or ‘negative’ (<x). Then summarising test results over all patients produces aggregate data in the form of r11ix, the number of truly diseased patients in study i with a positive test result at threshold x, and r00ix, the number of non-diseased patients in study i with a negative test result. The observed sensitivity at threshold x in each study is thus simply r11ix/n1i and the observed specificity is r00ix/n0i.

When results for a particular threshold are missing but other thresholds above and below are available, then the missing threshold has sensitivity and specificity results constrained between these values. For example, consider the Al Ragib study (Table 1), which has threshold values of 0.13 and 0.18 available, but not 0.14 to 0.17. The number of true positives for the missing thresholds must be constrained between the other threshold values of 35 and 33. Similarly, the missing false positives must be within 42 and 51, the missing false negatives within 4 and 6 and the missing true negatives within 95 and 104.

Rather than ignoring missing thresholds that are bounded between known thresholds, our exploratory method imputes the missing results under particular assumptions, so that they can be included in the meta-analysis. The aim is to ascertain whether the original meta-analysis conclusions (obtained without imputed data) are robust to the inclusion of imputed data. For example, does the summary test accuracy at each threshold remain similar, and is the choice of best threshold the same? The exploratory method is a two-step approach, as now described.

Step 1: imputation of missing bounded thresholds in each study

In each study separately, for each threshold that is missing but bounded between known thresholds, the missing results are imputed by assuming each 1-unit increase in threshold value corresponds to a constant reduction in logit-sensitivity (y1ix), and also a constant increase in logit-specificity (y0ix). Thus, imputation is on a straight line between pairs of observed points in logit receiver operating characteristic (ROC) space. This piece-wise linear approach is illustrated graphically in Figure 1. So the key assumption here is a constant change in logit values for each 1-unit change in threshold value between each pair of observed threshold results. The linear slope can be different between each pair of thresholds, and so no single trend is assumed across all thresholds, with the fitted lines forced to go through the observed points. Linear relationships on the logit scale are often used in diagnostic test analyses when considering the ROC curve, especially in meta-analysis [12], and it is a straightforward approach for this exploratory analysis.

Figure 1
figure 1

Graphical illustration of the imputation approach for the Al Ragib study.

Once the imputed logit values are obtained, one can back transform to compute the corresponding imputed true and false positives and negatives. For example, let TP1 be the true positive number at a threshold value of 0.13 and TP2 be the true positive number at threshold 0.18. Then, in the Al Ragib study there are 5 threshold units from 0.13 to 0.18. The imputed logit-sensitivity at threshold 0.14 is y1i(0.13) + (y1i(0.18) - y1i(0.13))/5, and at threshold 0.15 the imputed logit-sensitivity is y1i(0.13) + (2(y1i(0.18) - y1i(0.13))/5), and so on (Table 2, Figure 1). It is then straightforward to calculate the number of true positives, false positives, true negatives and false negatives that are necessary to produce these values. For example, for threshold 0.15, the imputed logit-sensitivity is 1.983, and so the imputed sensitivity is 0.879, and given there are 39 patients truly with high proteinuria, the imputed number of true positives and false negatives is 34.28 and 4.72, respectively (Table 2).

Table 2 Actual and imputed results for the Al Ragib study between thresholds 0.13 and 0.18

Note that we neither impute beyond the highest available threshold nor impute below the lowest available threshold in each study. Further assumptions would be necessary to do this, but here we only work within the limits of the observed data available. Thus, for studies with only 1 threshold reported, no imputation was used. Similarly, we do not impute for any new threshold values which were not considered by any of the available studies.

A STATA ‘do’ file to fit the imputation method is available in Additional file 1, and we aim to release an associated STATA module in the near future. It provides the original and imputed values within a few seconds, for any number of studies and any number of thresholds.

Step 2: meta-analysis at each threshold separately using actual and imputed data

The imputation in step 1 borrows strength from available thresholds to allow a larger set of threshold data to be available from each study for meta-analysis. For ease of language, let us order the thresholds of interest and refer to the ordered value as t (e.g. t = 1 to 23 in the PCR example, Table 1). Each threshold t now has (i) one or more studies with observed results and potentially (ii) some studies with imputed results. A separate meta-analysis of each threshold separately can now be considered, using the observed and imputed results. A convenient model is the bivariate meta-analysis of Chu and Cole [2]. This approach is recommended by the Cochrane Screening and Diagnostic Test Methods Group and is commonly used in diagnostic accuracy meta-analyses. It utilises the exact binomial within-study distribution, thereby avoiding the need for any continuity corrections, and accounts for any between-study correlation in sensitivity and specificity, as follows:

(1)

β1t and β0t give the average logit-sensitivity and average logit-specificity at threshold t, respectively, and these can be transformed to give the summary sensitivity and summary specificity from the meta-analysis for each threshold. The between-study covariance matrix is given by Ω t , containing the between-study variances ( and ) and the between-study correlation in logit-sensitivity and logit-specificity (ρ10t); if the latter is zero, the model reduces to a separate univariate analysis for each of sensitivity and specificity. Indeed, ρ10t will often be poorly estimated at +1 or -1 [13], and so it may be sensible to adopt two separate univariate models here [14], as follows:

(2)

Models (1) and (2) can be estimated using adaptive Gaussian quadrature [15], for example using PROC NLMIXED in SAS [16], or the xtmelogit command in STATA [17]. A number of quadrature points can be specified, with increasing estimation accuracy as the number of points increases, but at the expense of increased computational time. We generally chose 5 quadrature points for our analyses, as this gave estimates very close to those when using >10 points but in a faster time. Successful convergence of the optimization procedure was assumed when successive iteration estimates differed by <10-7, resulting in parameter estimates and their approximate standard errors based on the second derivative matrix of the likelihood function.

Empirical evaluation of the imputation method

To empirically evaluate the imputation method, we utilised individual participant data (IPD) from six studies examining the ability of parathyroid hormone (PTH) to correctly classify which patients will become hypocalcemic within 24-h after a thyroidectomy [18]. The percentage decrease in PTH (from pre-surgery to 6-h post-surgery) was used as the test, and thresholds of 40%, 50%, 60%, 65%, 70%, 80% and 90% were examined. As IPD were available, the results for all thresholds were available for all studies. Thus, for each threshold separately, we could fit model (1) using the complete set of data from the six studies. However, model (1) often poorly estimated the between-study correlations at +1 or -1 and gave summary test accuracy results very similar to model (2). Thus we focus here only on model (2) results, and these provided our ‘complete data’ meta-analysis results, for when the thresholds are all truly available from all studies.

Generation of missing thresholds and imputation

To replicate missing data mechanisms, we considered two scenarios:

Scenario (i): thresholds missing at random We took the complete set of threshold results for each study and randomly assigned some to be missing, with each having a 0.5 probability of being omitted. This provided a new meta-analysis dataset of up to six studies with missing threshold results. We repeated this process until 1,000 such meta-analysis datasets had been produced. In each dataset, we applied our imputation approach, and then for each threshold, we fitted model (2) to (i) each generated dataset without including the imputed results and (ii) each generated dataset with the addition of the imputed results. We then compared the average meta-analysis estimates and standard errors from the 1,000 analyses of (i) and (ii) with the true meta-analysis results when the complete data were available (Table 3).

Table 3 Empirical evaluation results—scenario (i), thresholds missing at random

Scenario (ii): thresholds selectively missing We took the complete set of threshold results for each study and assigned some to be missing through a selective (not missing at random) mechanism, based on the observed sensitivity estimate. All thresholds with the observed sensitivity ≥ 90% were always included; however, those with sensitivity <90% had a 0.5 probability of being omitted. This reflects a realistic situation where researchers are more likely to report those thresholds where sensitivity is observed to be high. We repeated this process until 1000 such meta-analysis datasets had been produced. In each dataset, we applied our imputation approach, and then for each threshold, we fitted model (2) to (i) each generated dataset without including the imputed results and (ii) each generated dataset with the addition of the imputed results. We then compared the average meta-analysis estimates and standard errors from the 1,000 analyses of (i) and (ii) with those true meta-analysis results when the complete data were available (Table 4).

Table 4 Empirical evaluation results—scenario (ii), thresholds selectively missing

Results

Results of empirical evaluation

In the empirical evaluation, there is no imputation for the lowest and highest threshold of 40% and 90%, and so meta-analysis results are identical for these thresholds regardless of whether imputed data is included or not. However, for other thresholds, there is always potential for imputation.

For thresholds between 40% and 90% in scenario (i), where thresholds were missing at random, the mean and median estimates tend to be slightly closer to the complete data results when using the imputed data (Table 3). For example, at threshold 65%, the true sensitivity estimate from complete data is 0.85, whilst the mean/median without using imputed data is 0.87/0.89, but when using the imputed data, it is pulled back to 0.85/0.85. Further, the meta-analyses including the imputed data give substantially smaller standard errors than those from the meta-analyses excluding imputed data. For example, for threshold 65%, the standard error of the summary logit-sensitivity is 0.69 when ignoring imputed data and 0.49 when including it, a gain in precision of almost 30%. The gain in standard error reflects the additional information being used from the imputed results. As estimates are close to the true estimates and standard errors are considerably reduced, the mean-square error of estimates is therefore improved. The imputation approach also gives standard errors of estimates that are also closer (but not smaller) to those from the true complete data meta-analysis.

For scenario (ii), where thresholds are selectively missing based on the value of observed sensitivity, the summary meta-analysis results without the imputed data are again slightly larger than the true estimated values, especially for sensitivity. The meta-analysis results using the imputed data generally reduce this bias and give more conservative estimates. For example, for the 50% threshold, the true estimated value for sensitivity is 0.87, whilst the median meta-analysis estimate without imputation data is 0.90, but the median estimate including imputed data is pulled back down to 0.87. Occasionally, the imputation method over-adjusted, so that the estimate was pulled down too far. For example, for the 80% threshold, the true estimated value for sensitivity is 0.73, but the median estimate from the without imputation data is 0.75 and the median estimate from the imputed data is 0.71. However, even here the absolute magnitude of bias is the same (0.02) with and without imputed data. For all thresholds between 40% and 90%, there is again considerable reduction in the standard error of meta-analysis estimates following the use of imputed data.

In summary, the empirical evaluation shows that the imputation method performs well, with summary test accuracy estimates generally moved slightly closer to the true estimates based on complete data. This finding, combined with smaller standard errors and thus smaller mean-square error of estimates, suggests the imputation approach is useful as an exploratory analysis.

Application to the PCR example

Our imputation approach was applied to the PCR studies introduced in the “Motivating example” section, and missing threshold results could be imputed in 6 of the 13 studies. For 21 of the 23 different thresholds, the imputation approach increased the number of studies providing data for that threshold (Table 5), and in total, an additional 50 thresholds results were imputed, substantially increasing the information available for meta-analysis. For example, at a PCR threshold of 0.22, the imputation increased the available studies from 1 to 5, and at a threshold of 0.3, the available studies increased from 4 to 7.

Table 5 Summary meta-analysis results following application of model (2) with and without the imputed data included

Meta-analysis model (1) was applied to each threshold’s data separately, but the between-study correlations were often estimated poorly as -1 in these analyses, so we decided to rather fit model (2) (i.e. ρ10t was set to zero for all analyses, allowing a separate analysis for sensitivity and specificity at each threshold) [14]. The summary meta-analysis results when including or ignoring the imputed data are shown for each threshold in Table 5 and Figure 2.Importantly, the results when including the imputed data are often very different to when ignoring it. In particular, the summary estimates of sensitivity and specificity are generally reduced when using the imputed data, as can be seen visually in the summary ROC space (Figure 2). For example, when imputed data were included, the summary specificity at a PCR threshold of 0.16 reduced from 0.80 to 0.66 and the summary sensitivity at a PCR threshold of 0.25 reduced from 0.95 to 0.85.

Figure 2
figure 2

Summary meta-analysis results presented in ROC space, comparing the summary meta-analysis results shown in Table5, with and without inclusion of imputed thresholds. To help compare approaches, summary estimates for the same threshold are shown connected.

The points in ROC space tend to move down and to the right after including imputed data, revealing lower sensitivity and specificity than previously thought. Thus, it appears that the results when ignoring imputed data may be optimistic, potentially due to biased availability of thresholds when they give higher test accuracy results. In both analyses (assuming sensitivity and specificity are equally important), the best threshold appears to be between 0.25 and 0.30; however, test accuracy at these thresholds is lower after imputation.

The dramatic change in results for some thresholds suggests that individual patient data are needed to obtain a complete set of threshold results from each study and thereby remove the suspected reporting bias in primary studies. We also attempted to use the advanced statistical modelling framework of Hamza et al. [12] to reduce the impact of missing thresholds by jointly synthesising all thresholds in one multivariate model; however, this approach failed to converge, most likely due to the amount of missing data. The multiple thresholds model of Putter et al. [19] also required complete data for all thresholds, whilst the method of Dukic and Gatsonis [20] was not considered suitable, as it produces a summary ROC curve but does not give meta-analysis results for each threshold.

Discussion

Primary study authors often do not use the same thresholds when evaluating a medical test and will predominately report those thresholds that produce the largest (optimal) sensitivity and specificity estimates [21]. This may lead to optimistic and misleading meta-analysis results based only on reported thresholds. We have proposed an exploratory method for examining the impact of missing threshold results in meta-analysis of test accuracy studies and shown its potential usefulness through an applied example and empirical evaluation. The imputation method is applicable when studies use the same (or similarly validated or standardised) methods of measuring a continuous test (e.g. blood pressure or a continuous biomarker, like prostate-specific antigen). It is deliberately very simple, so that applied researchers can still implement standard meta-analysis methods and examine the potential impact of missing thresholds on meta-analysis conclusions. For example, our application to the PCR data showed how the imputation method revealed lower diagnostic test accuracy results than a standard meta-analysis of each threshold independently, but conclusions about the best choice of threshold appeared robust.

Other more sophisticated methods are also available to deal with multiple thresholds, but all have limitations. Hamza et al. [12] propose a multivariate random-effect meta-analysis approach and apply it when all studies report all of the thresholds of interest. It models the (linear) relationship between threshold value and test accuracy within each study but is prone to convergence problems (as we experienced for the PCR example), prompting Putter et al. [19] to propose an alternative survival model framework for meta-analysing the multiple thresholds. However, this also requires the multiple thresholds to be available in all studies. Others have also considered the multiple threshold issue [20, 2227]. A well-known method by Dukic and Gatsonis [20] only produces a summary ROC curve, rather than summary results for each threshold of interest. We recently proposed a multivariate-normal approximation to the Hamza et al. approach [27], which produces both a summary ROC curve and summary results for each threshold and easily accommodates studies with missing thresholds. However, the multivariate-normal approximation to the exact multinomial likelihood is a potential limitation.

Our exploratory method is not a competitor to these more sophisticated methods. Rather, it is an exploratory tool aimed at researchers (usually non-statisticians) conducting systematic reviews of test accuracy studies. The method is practical and easy to implement without advanced statistical expertise and so can quickly flag whether researchers should be concerned about missing thresholds in their meta-analysis. This was demonstrated in the PCR example, where the method flagged major concerns that original conclusions were optimistic. In this situation, researchers should be stimulated to put resources toward undertaking the aforementioned advanced statistical methods or, ideally, obtaining individual participant data to calculate missing threshold results directly.

The key reason that we label our method as ‘exploratory’ is that it only considers single imputation. Single imputation of missing values usually causes standard errors of estimates to be too small, since it fails to account for the uncertainty in the imputed values themselves, and multiple imputations would help address this [28]. In particular, imputed data between two thresholds close together (e.g. imputing data for a threshold of 0.24 using available thresholds 0.23 and 0.25) should have less uncertainty than imputing data between two thresholds far apart (e.g. imputing at a threshold of 0.24 using thresholds 0.13 and 0.50), but this is not currently accounted for in our approach. Further research may consider extension to multiple imputations. Also, our imputation assumes a linear relationship between threshold value and logit-sensitivity and logit-specificity; although this linear relationship is commonly used in meta-analysis of test accuracy studies, it is of course an assumption.

Thus, our imputation method is a sensitivity analysis: it shows, under the assumptions made, how vulnerable the original meta-analysis conclusions are to the missing threshold results. The focus is therefore on how the method modifies the original summary meta-analysis estimates; less attention should be paid to the standard errors and confidence intervals it produces, as these may be artificially small and narrow. The method is thus similar in spirit to how others have evaluated the potential impact of (biased) missing data in meta-analysis of randomised trials, such as trim and fill [29] and adjustments based on funnel plot asymmetry [30]. For example, trim and fill imputes missing studies assuming asymmetry is caused by publication bias and Peters et al.[31] conclude it ‘can help to reduce the bias in pooled estimates, even though the performance of this method is not ideal … we recommend use of the trim and fill method as a form of sensitivity analysis.’ Similarly, our method can help to reduce bias and mean-square error in pooled meta-analysis results.

Conclusion

We have proposed an exploratory analysis that allows researchers to examine the potential impact of missing thresholds on the conclusions of a test accuracy meta-analysis. Currently, most researchers ignore this issue, but our PCR example shows that this may be naive, as conclusions are susceptible to selective threshold reporting in primary studies. STATA code to fit the imputation approach is available in the Additional file 1, and an associated STATA module will be released in the near future.