Introduction

Nineteen per cent of patients with colorectal cancer are diagnosed with distant metastasis at initial presentation, which (when undertreated) is associated with a 5-year survival rate of 7% [1]. Furthermore, local and distant recurrences occur in 30–50% of patients during follow-up after primary surgery [2]. Whereas in many patients metastatic disease cannot be cured, in carefully selected patients a resection of the metastases has been reported to result in 5-year survival rates up to 30–40% [3]. The therapeutic options for colorectal metastases—including surgery and chemotherapy—as well as the clinical outcome depend strongly on accurate evaluation and early identification of recurrent lesions.

Metastatic disease in colorectal cancer is most common in liver and lung, but can affect the whole body. Whole-body imaging is important in different clinical settings. First, at primary staging of colorectal cancer it is important to determine the local and distant spread of the tumour to determine the risk profile and the indicated treatment. Second, whole-body imaging can be used either as part of a surveillance programme after surgery for colorectal cancer or when a recurrence is suspected on the basis of clinical examinations. In the clinical setting in which patients have suspected recurrence, it is unclear which whole-body staging modality is most accurate for the detection of a recurrence. Currently, in this specific setting, computed tomography (CT) is used to detect recurrence, even though CT has a high false-positive rate for pulmonary lesions and high false-negative rate for extrahepatic intra-abdominal lesions (e.g. para-aortic nodes) [4, 5]. Several studies have reported good results for whole-body staging with 18F-fluorodeoxyglucose (FDG) positron emission tomography (PET) and FDG PET/CT [68]. The experience with magnetic resonance imaging (MRI) is limited.

So far, there is no consensus on which is the most accurate whole-body imaging technique for colorectal cancer patients who have a suspicion of recurrence based on clinical findings or rise in carcinoembryonic antigen (CEA). Therefore, the objective of the present study is to perform a meta-analysis of published studies in order to determine which is the most accurate whole-body imaging modality for the detection of recurrent disease in patients with colorectal cancer who have suspected local and/or distant recurrent disease and to advise which modality is most suitable in clinical practice.

Materials and methods

A literature search was performed in PubMed/MEDLINE and Embase up to May 2010 using the following search terms: ‘colorectal neoplasm or carcinoma or cancer’, ‘whole-body imaging or staging’, ‘neoplasm staging’, ‘colorectal’, ‘metastasis’, ‘recurrence’, ‘positron emission tomography’ or ‘PET’, ‘magnetic resonance imaging’ or ‘MRI’, ‘computed tomography’ or ‘CT’ and ‘PET-CT’ or ‘PET/CT’. PET refers to FDG PET. No language restriction was used. Studies were included when they met the following criteria: (1) focus on metastasis and/or recurrence detection in patients with suspected recurrence in the follow-up for colorectal cancer, (2) study population included more than 20 patients with colorectal cancer, (3) results were given in a 2 × 2 contingency table or this table could otherwise be derived from the article and (4) reference standard combined histology with follow-up. Case reports, reviews, articles that evaluated local staging only or detection of liver metastases only and studies that evaluated whole-body imaging at primary staging or for patients with known hepatic metastases were excluded. Last, studies which evaluated response to therapy only were also excluded.

Two reviewers (IJGR and MM) independently searched the databases for eligible studies. The reviewers checked the titles and abstracts of the identified studies in order to select studies which potentially met the inclusion criteria. Thereafter they independently studied full text copies of the selected studies to make a decision as to which studies met the inclusion criteria. In case of disagreement, consensus was reached. Reference lists were checked to find additional eligible studies. Data which were extracted from the studies were: (1) number, gender and age of patients, (2) study objective, (3) type of reference standard, (4) unit of analysis (lesion or patient based analysis), (5) degree of blinding, (6) duration of follow-up and (7) prevalence of disease. Study quality was assessed with the QUADAS checklist for studies of diagnostic accuracy included in systematic reviews [9].

Statistical analysis

Preferably, results from lesion based analyses were used for this meta-analysis, but part of the studies only reported data on a patient basis. Based on the results from the (derived) 2 × 2 contingency tables, pooled measures for diagnostic performance, such as sensitivity, specificity, diagnostic odds ratio (DOR) and area under the receiver-operating characteristic (ROC) curve (AUC), were calculated using random effects models. The pooled DOR for each imaging modality was used for the construction of summary ROC (SROC) curves. SROC curves account for the so-called threshold effect in diagnostic studies, which arises when studies use different cutoff points or thresholds to define a positive or negative test result. The DORs combine sensitivity and specificity into one measure for diagnostic performance. A DOR of 1 means that the test has no ability to discriminate. The higher the DOR, the better the ability of a test to discriminate between subjects with and without the disease of interest. To test differences in diagnostic performance between modalities for statistical significance, the relative DOR of one modality compared to another was calculated with its corresponding p value.

The Cochran Q test was used to test for heterogeneity between individual study results. Significance of this test indicates that differences between study results cannot solely be attributed to sampling variation. The p value for heterogeneity was considered significant when p < 0.10, because heterogeneity tests are known for their lack of statistical power. Differences in DORs between studies can also result from differences in design, case mix and analysis. To account for heterogeneity, the DOR and AUC for the imaging modalities under study were pooled within subgroups of studies. These subgroups were made according to the presence or absence of a specific study characteristic that can affect the estimate of the diagnostic performance of a modality.

Pooled estimates of diagnostic performance and relative DORs were calculated with Meta-DiSc version 1.4[10], a software programme which implements meta-regression using a generalization of a model that was proposed by Moses et al. [11].

Results

With the search 82 studies were retrieved, of which a total of 60 articles potentially met the inclusion criteria after selection based on titles and abstracts. Of these 60 articles, 46 were excluded[1256], leaving 14 articles for inclusion [2, 68, 5766]. The 46 articles were excluded because of the following reasons. Twenty articles studied patients with primary colorectal cancer or patients with known hepatic metastases [13, 14, 16, 18, 20, 21, 23, 25, 29, 30, 3336, 38, 40, 41, 44, 55, 56]. Nine studies were excluded because they included less than 20 patients with colorectal cancer [12, 15, 19, 22, 24, 26, 31, 32, 50]. Six articles were excluded because a 2 × 2 contingency table could not be constructed [28, 39, 43, 45, 53, 54], five evaluated patients after recent treatment with chemotherapy or evaluated response after treatment and were therefore excluded [17, 27, 37, 42, 49] and three were a meta-analysis and/or review [47, 48, 52]. One was a case report [51]. One article evaluated the accuracy of a combination of diagnostic modalities to detect lesions without specifying individual accuracies per modality and was thus excluded [46]. Finally, one study was excluded because it only evaluated the clinical reports for PET and did not re-evaluate the images for the study [67]. Study identification and inclusion are shown in a flowchart in Fig. 1. Results of the quality assessment with the QUADAS checklist are shown in Table 1.

Fig. 1
figure 1

PRISMA flowchart describing the identification and inclusion of studies

Table 1 QUADAS checklist for all included studies

Individual study characteristics are presented in Table 2. Of the 14 articles included, 3 studied a single modality [2, 8, 62] and 11 compared two or three different modalities [6, 7, 5761, 6366]. Grouping the articles according to investigated imaging modality, 12 articles studied the performance of PET [68, 5762, 6466], 7 articles studied the performance of CT [6, 8, 57, 59, 61, 64, 66], 5 articles studied the performance of PET/CT [2, 7, 58, 61, 63] and 1 article studied the performance of MRI [63]. In two studies which evaluated both PET and CT only part of the thorax and abdomen was imaged with CT and therefore only the results for PET were included from these studies [59, 66]. So, in total results from five studies were available for CT. All CT studies used intravenous contrast. The number of patients ranged from 24 to 115 patients per study, with a total of 861 patients evaluated in all included studies. The percentage of male patients varied from 46 to 71% and the mean or median age ranged from 58 to 68 years. All studies used histopathology or a combination of histopathology, clinical and radiological follow-up, conventional diagnostic modalities (X-ray, endoscopy, ultrasound) and surgical exploration as reference method. The indication for whole-body imaging was suspected local or distant recurrence based on clinical symptoms, rise in CEA levels, endoscopy findings or findings from other imaging methods in all studies for all or the majority of patients.

Table 2 Characteristics of the included articles

Estimates of diagnostic performance, such as sensitivity, specificity, positive and negative predictive value, accuracy, DOR and AUC for all individual studies are shown in Table 3.

Table 3 Diagnostic performance for all the included studies, sorted by modality

Summary receiver-operating characteristic curves

SROC curves for the diagnostic performance of PET, PET/CT and CT and the individual study results are shown in Fig. 2. PET and PET/CT had the best diagnostic performance for recurrence detection, with DORs of 55.2 [95% confidence interval (CI) 23.2–131.2] and 55.3 (95% CI 15.9–191.8), respectively, compared to a DOR of 9.8 (95% CI 4.2–22.8) for CT. The single study concerning MRI was not included in the regression analysis but the results are shown in the graph as a single value. The DOR for this MRI study was 35.1 (95% CI 13.5–90.4). The corresponding AUCs for PET, PET/CT and CT for recurrence detection were 0.94 (95% CI 0.90–0.97), 0.94 (95% CI 0.87–0.98) and 0.83 (95% CI 0.72–0.90), respectively. CT had a significantly lower diagnostic performance than PET (p = 0.021). Between CT and PET/CT the difference was not significant (p = 0.10). The difference between PET and PET/CT was not significant either (p = 0.66). The AUC for the single MRI study was 0.92 (95% CI 0.86–0.96). Pooled sensitivity and specificity, AUC and DOR for the modalities are shown in Fig. 3.

Fig. 2
figure 2

SROC curves with all individual study results for all modalities. The single study on MRI is displayed as a single value in the graph. PET (n = 12), CT (n = 5), PET/CT (n = 5), MRI (n = 1)

Fig. 3
figure 3

Pooled sensitivity (%) and specificity (%), area under the ROC curve (AUC, %) and diagnostic odds ratio (DOR) for CT, PET and PET/CT with 95% confidence intervals indicated by error bars

Subgroup analyses

The Cochran Q test showed that there is significant heterogeneity between study results for each imaging modality (p < 0.10). This heterogeneity is also illustrated in Fig. 2, which shows substantial scatter of observed pairs of sensitivity and specificity of individual studies around the fitted SROC curves. To correct for potential sources of heterogeneity subgroup analyses were performed. Pooled estimates of diagnostic performance were calculated within subsets of studies that differed with respect to factors that potentially can affect diagnostic performance: (1) unit of analysis (patient based versus lesion based), (2) prevalence (percentage of patients with malignant disease in the studied population) as an indicator of disease spectrum (<75% versus ≥75%), (3) blinding to clinical information (yes versus no), (4) design (retrospective versus prospective) and (5) year of publication (<2003 versus ≥2003). Full blinding was defined as blinding to both clinical information and other imaging results. Partial blinding was defined as blinding for other imaging results only. In all subgroups CT remained the modality with the lowest diagnostic performance. The results of the subgroup analyses are displayed in Fig. 4. PET had a significantly lower diagnostic performance when a study was published after 2003: AUC was 0.96 (95% CI 0.92–0.98) before 2003 vs 0.87 (95% CI 0.78–0.92) after 2003, p = 0.013.

Fig. 4
figure 4

Areas under the SROC curve with 95% confidence intervals (error bars) per modality for subgroups. Prevalence refers to the prevalence of disease in the studied population. Fully blinded is defined as reading the images without any knowledge about the patient. Clinical info indicates that readers were aware of clinical information about the patients, but had no knowledge about results from other imaging studies. In some subgroups columns are missing for one or more modalities, because no or only one study was available for that subgroup and thus the subgroup analysis could not be performed

Discussion

In this meta-analysis we compared PET, PET/CT, CT and MRI for whole-body staging in patients who have suspected recurrence in the follow-up for curatively treated colorectal cancer. We found that PET and PET/CT have a high diagnostic performance with an AUC of 0.94 for both PET and PET/CT. CT had a significantly lower diagnostic performance than PET or PET/CT with an AUC of 0.83. This lower diagnostic performance persisted after correction for differences in design and analysis of studies. The subgroup analyses showed that in studies in which readers were fully blinded (to both clinical information and other imaging results) or in studies which were published after 2003 the diagnostic performance was lower for PET. The single study evaluating MRI showed a high AUC of 0.92.

PET and PET/CT were the most accurate modalities. PET and PET/CT are metabolic imaging techniques that provide information on the nature of a lesion based on differences in glucose metabolism. Malignant lesions have a higher glucose metabolism and thus a higher uptake of FDG. These changes in metabolism are known to precede changes in morphology (which are evaluated with CT), hence the higher sensitivity for PET or PET/CT than for CT in the detection of small malignant lesions. In Fig. 5 an illustration is given of a lesion with high FDG uptake (and thus detection with PET) which could not (yet) be identified with CT. FDG uptake is also increased in inflammatory tissue and in normal organs such as the brain, the urinary tract and bowel, causing false-positive findings. By combining the functional information of PET with the morphological information of CT false-positive errors can be reduced and superior performance of PET/CT over PET and CT as stand alone techniques is expected. Nevertheless, the results of our meta-analysis do not confirm superior performance for PET/CT over PET. There are some methodological issues related to this finding. Only three studies compared PET/CT with PET within the same patient group and therefore most of the data originated from studies without direct comparison of both modalities in the same patients [7, 58, 61]. Differences in study designs between the PET and PET/CT studies could have influenced the results and this may have favoured the performance of PET. The results of the three studies that did compare PET with PET/CT in the same study population showed a superior performance for PET/CT, especially in the patient based analyses [7, 58, 61]. In our subgroup analyses of patient based study results we could confirm the higher performance of PET/CT over PET on a patient basis (AUC 0.95 for PET/CT vs 0.92 for PET, Fig. 4). The three studies also found that readers were more confident in their diagnosis of lesions with PET/CT than with PET only. Kim et al. reported highest confidence level scores for PET/CT (91%) compared to 61% for PET and 50% for CT [7].

Fig. 5
figure 5

Diagnostic CT image (left) and PET image (right) of a patient who has a clearly visualised para-aortic lesion on PET (arrow), which cannot be discerned on CT

CT had a lower diagnostic performance than PET/CT and PET. The cause may be that the accuracy of CT for extrahepatic metastasis detection is lower than that of PET and PET/CT. CT is known to be more accurate in the detection of hepatic than in the detection of extrahepatic metastases (including local recurrence), making it less ideal for whole-body staging. Studies in this meta-analysis have shown that with respect to the detection of extrahepatic lesions CT performs worse (sensitivities 53–71% and specificities 50–85%) than PET (sensitivities 70–100% and specificities 40–100%) [59, 61, 66]. Several older studies have reported a low diagnostic performance for the detection of local recurrence with CT [68, 69]. A more recent study by Stückle et al. with multislice CT acknowledged the low sensitivity (38–82%) for local recurrence detection but reported high specificity (97–100%) in the follow-up after surgery [70]. Because the studied populations in our meta-analysis comprised patients who had both local and distant recurrence, the diagnostic performance of CT can be influenced by the fact that local recurrence detection is difficult with CT (Fig. 6).

Fig. 6
figure 6

CT image (left) and PET image (right) of a patient with locally recurrent colorectal cancer after a sigmoid resection. On PET a clear hot spot (arrow) is found with increased FDG uptake, while on CT it was not recognised as a local recurrence (arrow)

Finally, most included studies evaluating whole-body CT were performed around the year 2000 and it should be taken into account that since then the quality of CT may have improved considerably. This may have caused underestimation of the diagnostic performance of CT in this meta-analysis. However, a more recent study (2007) comparing a modern CT technique with PET and PET/CT within the same patient population by Nakamoto et al. still showed that CT had the lowest diagnostic performance, while PET/CT was the most accurate modality.

An interesting finding in this meta-analysis is that in more recently published studies the diagnostic performance of PET was significantly worse than in earlier publications. This phenomenon is observed more often in diagnostic studies and may be explained by publication bias. In the late 1990s PET was a relatively new modality, so the chance for acceptance for publication was higher for positive study results. Another possible explanation is that study design and methodology have improved over time leading to more critical evaluation of the modality and thus possibly lower diagnostic performance of a modality.

Awareness of clinical information clearly improved performance. The diagnostic performance of studies in which readers were fully blinded was lower than in studies in which readers were aware of all clinical information except for results of other imaging modalities. The largest difference was observed for PET (AUC 0.91 for full blinding vs 0.98 in case of awareness of clinical data). This finding is in agreement with clinical experience that knowledge of the clinical information of the patient is considered crucial to achieve sufficient diagnostic performance, particularly for PET. These findings underline the necessity for clinicians to provide radiologists and nuclear physicians with full information about the patient’s clinical status and the importance for radiologists and nuclear physicians to be involved in multidisciplinary management teams, where they are confronted with the clinical situation.

One established modality that has been used increasingly in the last decade—in particular for the follow-up of patients with suspected local recurrence after surgery for rectal cancer—is MRI [71]. Although MRI has shown to be feasible for the detection of local recurrences, its yield is not high enough to warrant routine use in the follow-up of rectal cancer patients [72, 73]. The single study that evaluated whole-body MRI showed good results, but more evidence is needed to establish the role of MRI in whole-body imaging for colorectal cancer.

Limitations

There are some limitations to this study. The first important issue is that this is a meta-analysis of published studies and therefore heterogeneity between studies is present. To account for this heterogeneity we performed subgroup analyses according to factors that were likely to cause heterogeneity and still found that CT had a lower diagnostic performance than PET and PET/CT. However, because of the relatively small number of studies per modality, simultaneous correction for more than one study factor was not feasible, making our level of evidence less robust. Moreover, residual heterogeneity may have remained unexplained due to some unmeasured or unreported study characteristics, which is inherent to a meta-analysis based on published data.

Second, in all studies a combination of pathology and follow-up was used as the reference standard. However, undetected lesions will not be discovered until they become visible with imaging and therefore there is a chance for verification bias. Most studies that used follow-up as the reference standard had a follow-up time of at least 6 months. However, small missed lesions might become visible after a longer interval. Verification bias could then lead to overestimations of accuracy.

Third, because we aimed to evaluate whole-body staging for the detection of both local and distant recurrences of colorectal cancer in patients with suspected recurrence based on clinical findings or rise in CEA, we excluded studies which merely provided data on liver recurrence or local recurrence only.

Last, most PET/CT studies in our meta-analysis used side-by-side comparison of single PET and single CT, and fused PET/CT was only scarcely used.

Conclusions and clinical relevance

Our study suggests that for whole-body imaging of patients with a (high) suspicion of recurrent colorectal cancer during follow-up PET/CT is the most accurate imaging modality, closely followed by PET, which performs slightly lower than PET/CT on a patient basis. CT has the lowest diagnostic performance.

This meta-analysis explored diagnostic performance in the clinical setting in which patients had suspected local and/or distant recurrence based on clinical findings or a rise in CEA. In current clinical practice CT is the most widely used modality for this type of patients and only when CT findings are equivocal, PET or PET/CT is performed. Our meta-analysis shows that instead of CT as the first-line imaging modality, PET/CT might be the recommended modality for patients with suspected local or distant recurrence based on clinical findings or rise in CEA. In such patients, a negative CT result does not seem to help in excluding a recurrence and should be followed by PET/CT anyhow. Furthermore, when CT findings are equivocal, PET/CT is needed to further characterise lesions and when CT detects malignant lesions, PET/CT is obligatory to search for additional metastases when curative surgery of the malignant lesions is considered. However, while interpreting these results one should keep in mind that there were some limitations of this meta-analysis with regard to heterogeneity and number of studies, which make the estimate of diagnostic performance less precise and less definitive. Furthermore, whether implementation of this recommended diagnostic strategy is feasible in clinical practice will also depend on the cost-effectiveness of this approach.