Introduction

Cross-sectional imaging techniques are widely used for diagnosis and evaluation of Crohn’s disease. Numerous studies have evaluated the diagnostic accuracy of cross-sectional imaging techniques in patients with Crohn’s disease, and a meta-analysis was published that investigated the diagnostic accuracy of computed tomography (CT), magnetic resonance imaging (MRI), ultrasound (US) and scintigraphy [1]. However, clinical monitoring and choice of therapy largely rely on grading of disease activity.

Clinical symptoms and inflammatory lesions can exist independently, so assessment of the bowel is essential in guiding therapy decisions [2]. If inflammation is present, it is important to distinguish between mild, moderate and severe disease, as medical management differs among these stages [3]. Ileocolonoscopy, the current reference standard for luminal Crohn’s disease, is accurate for assessing mucosal abnormalities, but it has several drawbacks, as it is an invasive technique, is associated with the risk of bowel perforation, is incapable of assessing trans- and extraluminal disease, and is limited to the colon and terminal ileum [4]. Video capsule endoscopy (VCE) is a well-tolerated and accurate alternative to ileocolonoscopy that allows assessment of the whole gastrointestinal tract, although it has shown lower specificity and bears the risk of capsule retention, which occurs in up to 13 % of patients with Crohn’s disease [5].

Cross-sectional imaging techniques that could accurately grade disease severity would be preferable to ileocolonoscopy, as they are non-invasive and not limited to the colon and terminal ileum. Several studies have looked at the use of cross-sectional imaging for assessing the severity of Crohn’s disease, but offered no comparison between imaging techniques, as no meta-analysis was performed [2, 6]. To our knowledge, only one such meta-analysis has been performed, but it evaluated only MRI and used a search period that ended in April 2007 [7]. This study showed that MRI correctly graded disease activity in 91 % of patients with frank (moderate-to-severe) disease. However, correct grading was limited in patients with disease in remission and with mild disease (62 % for both). Furthermore, no comparison with other imaging techniques was made and numerous articles on the grading of Crohn’s disease using MRI have been published since 2007.

Our purpose was to systematically review and compare the accuracy of CT, MRI, US, scintigraphy and positron emission tomography–computed tomography (PET-CT) in grading Crohn’s disease activity on a per-patient or per-segment basis as compared to endoscopy, biopsies or intraoperative findings by performing a meta-analysis. Furthermore, we aimed to investigate the degree of over- and under-grading for these imaging techniques.

Material and methods

This review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [8]. The review protocol was not published or registered in advance.

Literature search and strategy

We performed an electronic search in MEDLINE, EMBASE and Cochrane databases for studies examining the accuracy of CT, MRI, US, scintigraphy and PET (-CT) for grading Crohn’s disease activity in human subjects. Search terms ‘Crohn’s disease’ and ‘inflammatory bowel disease’ were combined using ‘OR’ and search terms for imaging modalities were combined using ‘OR’ as well. These two groups were combined using ‘AND’. The search period was limited from January 1983 to March 2014. Details of the search strategy are provided in the electronic supplementary material (Appendix E1).

Study selection on title and abstract

All articles retrieved from the electronic search were assessed by one observer (CP). Non-relevant articles and articles in the form of a review, case report, comment or letter were excluded. Subsequently, the remaining titles and abstracts were independently assessed by two observers (CP, JT) to identify potentially eligible articles. In cases of uncertainty, articles were deemed potentially eligible and retrieved as full text.

Study selection on full text

The full texts of the remaining articles were retrieved. Two observers (CP, JT) independently reviewed all eligible articles for the following inclusion criteria: (a) ten or more patients were included (fewer were considered case-series); (b) CT, MRI, US, scintigraphy or PET (-CT) was used to grade Crohn’s disease activity; (c) patients with clinically suspected inflammatory bowel disease (IBD) or known IBD/Crohn’s disease were included; (d) endoscopy, biopsies or intraoperative findings were used as a reference test; (e) imaging features used for grading disease activity were defined; (f) raw data were available to construct 3 × 3 tables; (g) articles were written in English, Italian, Spanish, French, German or Dutch; and (h) patients with Crohn’s disease could be analysed separately from other IBD patients. No patient age limits were applied. Articles in the form of a review, case report, conference abstract, comment or letter were excluded. In the case of duplicate publications, we excluded the studies with the lower number of patients. Disagreement regarding potential eligibility and inclusion was resolved by consensus. The observers were not blinded to author and journal names.

Study characteristics

Methodological characteristics

Both reviewers extracted study characteristics independently for all included articles using a standardized form. To assess the quality of the study design, we used a modified Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS 2) tool [9, 10], as it separately assesses risk of bias in several methodological domains (patient selection, index test, reference test and patient flow) using a number of signalling questions (Table 1). Risk of bias for each domain was described as high, low or unclear. In addition, concerns regarding the applicability of the patient population, index and reference test to the review question were rated by the observers as high, low or unclear. Disagreements were resolved by discussion.

Table 1 Methodological characteristics from the QUADAS tool and their corresponding signalling questions [9, 10]. The risk of bias is determined for every domain using the signalling questions

Patient characteristics

The following patient characteristics were recorded: number of patients included, number of patients in the analysis, whether patients were recruited consecutively, age characteristics, gender ratio, patient spectrum (i.e. known or suspected IBD or Crohn’s disease) and other selection criteria for patient inclusion.

Imaging characteristics

Imaging characteristics concerning type of equipment and basic specifications (type of scanner for CT, field strength and coil type for MRI, and transducer type for US), techniques used for evaluation (sequences for MRI, use of Doppler for US, labelling target and tracer type for scintigraphy), bowel preparation (fasting and/or laxatives), use of luminal and/or intravenous contrast medium, timing of post-contrast scans and use of spasmolytic drugs were extracted.

Reference test

All reference tests (i.e. endoscopy, biopsies or intraoperative findings) used for analysis were recorded.

Imaging and reference test interpretation

We recorded the following information regarding interpretation of imaging and reference tests: the interval in days between index and reference tests, bowel segments that were examined, grading criteria used for imaging and reference tests, imaging features used for evaluation of disease activity, and whether grading was performed on a per-patient and/or per-bowel-segment basis.

Data extraction

Grading results for imaging and reference tests were extracted with the grading scales used in individual studies (i.e. three-, four-, or five-grade scales). From this data, three-by-three contingency tables comparing results from index and reference tests categorized as none, mild or frank disease were constructed for each study. These categories did not use predefined criteria, but were formed either by using the original grading from each study (in the case of a three-grade scale) or by merging certain grades to form a three-grade scale. If a four-grade scale was used (none, mild, moderate or severe disease), groups with moderate and severe disease were merged into frank disease. For five-grade scales, the second and third scales were grouped into mild disease and the fourth and fifth were grouped into frank disease. When studies used multiple reference tests, we used intraoperative findings as the reference standard. In other cases, histological findings from biopsies were preferred over endoscopic findings. Because the imaging results in these studies were based on the most severe lesion, we considered histological data from biopsies as more lesion-specific and better resembling imaging results than endoscopic results.

Publication bias

To study publication bias, we followed the method by Deeks et al., as recommended in the Cochrane handbook for DTA reviews [11]. We first calculated effective sample sizes (ESS) for each study. We then performed linear regression analyses if enough datasets were available in a group (n > 5), with the proportion of accurate grading per study as the independent variable and 1/√ESS as the dependent variable. A significant regression coefficient (P < 0.05) was deemed sufficient to indicate publication bias.

Data analysis

For each study, we constructed three proportions: ‘accurate grading’, defined as the number of correctly graded patients or segments; ‘under-grading’, defined as the number of patients or segments on which the index test graded lower than the reference test; and ‘over-grading’, defined as the number of patients or segments on which the index test graded disease activity higher than the reference test. Datasets were sorted into groups by type of imaging, which were then subdivided by target of evaluation (per-patient or per-segment). To quantify heterogeneity we calculated the I2-statistic for each group. Data were pooled if more than one dataset was available in a group and the data were not too heterogeneous (I2 < 75 %) [12].

For the pooled data, we calculated mean logit accurate grading and under- and over-grading values with corresponding standard errors using non-linear fixed or random effects models based on the Akaike information criterion (AIC) statistic (a lower AIC value indicates a better fit) [13, 14]. Using anti-logit transformation, we obtained summary estimates with 95 % confidence intervals (95 % CI) for accurate grading and over- and under-grading. In several studies, multiple datasets were available (i.e. multiple readers). Because we used all datasets for analysis, we adjusted the correlation between datasets from the same study by adding the same number for each study in the subject statement of the random effects approach.

Comparison of CT, MRI, US and scintigraphy was performed with Z-tests using the logit values of the pooled data. For data that was not pooled, we performed logit transformation using proportion and sample size (n) to enable comparison. To calculate logit values for proportions of 0 or 100, we added 0.5 to the number of events [15]. P values less than 0.05 indicated a statistically significant difference. All data analyses were performed using Excel 2010 (Microsoft Corporation, Redmond, WA, USA), SPSS 22.0 (IBM SPSS Statistics for Macintosh, Version 22.0; IBM Corp., Armonk, NY, USA), and SAS 9.3 (SAS Institute, Cary, NC, USA) software programs.

Results

Search and study selection

The search yielded 9356 articles. After selection on title and/or abstract, 149 articles remained and were retrieved as full-text articles (Fig. 1). Of these remaining articles, 130 did not fulfil the eligibility criteria (Appendix E2). Nineteen articles met all inclusion criteria and were included for further data extraction. CT was evaluated in 3 [1618], MRI in 11 [1929], US in 3 [3032], and scintigraphy in 3 [18, 33, 34]. No articles evaluating PET-CT were found that met our criteria.

Fig. 1
figure 1

Flow diagram showing study selection

Study characteristics

Methodological characteristics

Evaluation of the imaging tests was performed blinded from the reference test in 13 studies [17, 18, 21, 22, 2430, 33, 34]. The reference test was performed blinded to the imaging results in 12 studies [16, 17, 19, 21, 24, 2630, 33, 34]. The remaining studies did not specify whether observers were blinded to other results [20, 23, 31, 32]. Fifteen of the studies included patients prospectively [1626, 28, 30, 31, 34]. Signalling questions for the QUADAS tool were answered with ‘yes’ in 78.9 % of cases (Fig. 2). Patient selection and index test domains showed less risk of bias than reference test and patient flow domains. Concern about applicability of patient selection and index and reference tests was generally low (Fig. 3).

Fig. 2
figure 2

QUADAS signalling questions (Table 1) per domain (from up to down: patient selection, index test, reference test and patient flow). The last column shows whether studies included patients prospectively.

Fig. 3
figure 3

QUADAS risk of bias per domain and concerns regarding applicability for domains of patient selection, index test and reference test

Patient characteristics

A total of 549 patients were included (75 for CT, 347 for MRI, 86 for US, and 58 for scintigraphy). The mean study size was 29 patients (range, 10–76). Study characteristics are presented in Table 2. In ten of the studies, patients were recruited consecutively [17, 19, 20, 2226, 28, 31]. Studies included patients with clinically suspected IBD, known IBD/Crohn’s disease, or a combination of both (12, 4, and 3 studies, respectively).

Table 2 Study characteristics

Imaging characteristics

Imaging equipment and specifications are presented in Tables 3, 4, 5 and 6. Bowel preparation (fasting and/or laxatives) was used in eight studies (1 CT, 7 MRI) [17, 2126, 28]. Luminal contrast medium was used in ten studies (3 CT, 7 MRI) [1618, 2123, 25, 2729], of which one used enteroclysis [27]. Intravenous contrast medium was used in 13 studies (2 CT, 11 MRI) [16, 17, 1929].

Table 3 CT characteristics
Table 4 MRI characteristics
Table 5 US characteristics
Table 6 Scintigraphy characteristics

Reference test

Endoscopy, biopsies and intraoperative findings were used in 11, 8 and 4 studies, respectively (Table 7). Three studies recorded results for both endoscopy and histology from biopsies, for which we used the histological data in our analysis [30, 33, 34].

Table 7 Imaging and reference test interpretation

Imaging and reference test interpretation

Thirteen of the studies used an interval of less than one month between imaging and reference test [17, 1923, 26, 28, 29, 3134]. The imaging features most commonly used for evaluation were bowel wall thickness and post-contrast enhancement (or tracer uptake for scintigraphy), which were both used in 17 studies (Table 7). The reference test and imaging criteria for each study are presented in Tables 8 and 9.

Table 10 Comparison table with results for imaging tests from the 3 × 3 data analysis and corresponding P values
Table 8 Original reference test criteria and categorization for this study

Publication bias

Linear regression analysis on MRI per-patient data showed a regression coefficient of 0.4 (95 % CI: −0.9 to 0.9), with no significant relationship between accurate grading and 1/√ESS (P = 0.09). Data in other groups were deemed insufficient for performing linear regression analyses.

Data analysis

Results from our data analysis are presented in Table 10. Three-by-three contingency tables for each study can be found in the supplementary materials (Appendix E3).

Table 9 Original imaging criteria and categorization for this study

Per-patient

Data was provided on a per-patient basis in 13 studies (evaluating CT in 2, MRI in 9, US in 1 and scintigraphy in 1) (Fig. 4). I2 values for overall grading accuracy for groups with more than one dataset were as follows: 67.7 % (95 % CI: 42.6–81.8 %) for CT, and 73.9 % (95 % CI: 56.2 − 84.4 %) for MRI.

Fig. 4
figure 4

Accurate grading, over- and under-grading per study on a per-patient and per-segment basis

CT and MRI data were pooled for each modality (I2 < 75 %). US and scintigraphy were not pooled, as only one dataset was available for each modality. CT, MRI, US and scintigraphy showed accurate grading estimates of 86 % (95 % CI: 75–93 %), 84 % (95 % CI: 67–93 %), 44 % (95 % CI: 28–61 %) and 40 % (95 % CI: 16–70 %), respectively. CT and MRI showed similar overall grading accuracy (P = 0.8), both higher than US (P = 0.0001 and P = 0.001, respectively) and scintigraphy (P = 0.003 and P = 0.01, respectively). CT and MRI showed similar over-grading (P = 0.8) and under-grading (P = 0.5). Both showed less under-grading than US (P = 0.002 and P = 0.003, respectively) and scintigraphy (P = 0.0005 and P = 0.001, respectively).

Per-segment

Data were provided on a per-segment basis in seven articles, of which one evaluated both CT and scintigraphy, two evaluated MRI, two evaluated US, and two evaluated scintigraphy, respectively (Fig. 4). I2 values were 86.3 % (95 % CI: 66.4–94.4 %) for MRI, 91.5 % (95 % CI: 79.1–96.6 %) for US, and 0 % for scintigraphy. MRI and US data were not pooled, as data were too heterogeneous (I2 ≥ 75 %). Data on CT were also not pooled, as only one dataset was available. The overall grading accuracy was 87 % (95 % CI: 77–93 %) for CT and 86 % (95 % CI: 80–91 %) for scintigraphy. CT and scintigraphy showed similar overall grading accuracy (P = 0.8), over-grading (P = 0.2) and under-grading (P = 0.5). Accuracy for MRI and US ranged from 67 to 82 % and 56 to 75 %, respectively.

Discussion

In this study, we have shown that MRI and CT are highly accurate for grading Crohn’s disease activity. These findings are important, as cross-sectional imaging plays an increasing role in the assessment of Crohn’s disease activity, and there has been ongoing debate regarding the modality that should be the preferred choice [3537]. Several studies have compared two or more modalities in the same patient group [3841], but they have had relatively small sample sizes or only evaluated the terminal ileum.

CT and MRI showed similar accuracy in grading Crohn’s disease activity (86 % and 84 % on a per-patient basis, respectively), and no significant differences in accuracy were seen between these two modalities. Data on over- and under-grading showed similar results for CT and MRI, further strengthening our conclusion of their comparability. Scintigraphy showed high accuracy of 86 % and 86 % for the studies using per-segment data, while accuracy of 40 % was reported in per-patient data. However, per-patient data for scintigraphy was reported in only one study, and with a small sample size (n = 10) [34]. Furthermore, scintigraphy had the least number of included patients (n = 58) in our meta-analysis. US showed low accuracy of 44 % in the per-patient data and 75 % and 56 % for studies in the per-segment data. However, a relatively small number of patients (n = 86) were included. In addition, no eligible studies evaluated luminal or intravenous contrast medium for US. The use of intravenous contrast appears to be a particularly promising technique, and may increase the accuracy of US. However, no robust reference standard or appropriate grading scale were used in these studies. We considered the possibility of performing subgroup and covariate analyses on the differences in technique, imaging criteria, reference methods and methodological criteria, but the results of these analyses would not be meaningful given the limited amount of available data. We examined MRI imaging features in three studies with the highest accuracy values. The following MRI features were used by at least two of these studies: bowel wall thickness, T1 enhancement and pattern, T2 mural signal intensity, mucosal abnormalities, presence of inflammatory mass, stenosis (with pre-stenotic dilatation), lymph nodes, abscesses, and fistulas [25, 27, 29].

The observed heterogeneity of the grading criteria for the index and reference tests in the studies that we included, our adjustment to construct 3 × 3 tables, and the differences in available data between imaging modalities were the major limitations of this meta-analysis. Although the grading criteria for index and reference tests differed by study, and different imaging features were used, the studies included showed considerable overlap in the use of imaging features and grading criteria. No generally accepted scoring systems exist for imaging of Crohn’s disease. To construct 3 × 3 tables from original 4 × 4 data, we merged moderate and severe disease into one group. Our decision to merge these grades was based on five articles [22, 23, 25, 28, 30] that had originally used 3 × 3 tables; two of these studies explicitly stated that their highest grade represented moderate and severe disease combined [25, 28]. The remaining three studies [22, 23, 30] used similar grading criteria. Another limitation was the heterogeneity of grading results, which we examined using I2 statistics. Following those results, some of the datasets could not be pooled. In our conclusions, we took into account the greater availability of data for MRI compared to CT, US and scintigraphy. Furthermore, US and scintigraphy studies showed varying results, hampering our ability to arrive at a firm conclusion. There was only one head-to-head comparison study, which compared CT and scintigraphy in 17 patients [18].

We selected three reference standards for this meta-analysis [35]. Intraoperative findings served as the gold standard for assessing Crohn’s disease. We also included endoscopy and endoscopic biopsies as reference standards, although they are not ideal, as they are incapable of assessing proximal ileum, jejunum and extraluminal disease, which could have led to incorrect estimation of disease activity. On the other hand, surgery is performed only in select patients, whereas endoscopy is applied across a wider spectrum. For our analysis, we gave precedence to results from biopsies over endoscopic results, but we recognize that this was a controversial choice, as there is no widespread consensus on which is the better reference standard. The number of studies included could have been increased if VCE and/or double-balloon enteroscopy (DBE) were also used as a reference standard. We chose not to include these studies because interpretation of VCE and DBE has not yet been standardized, and so this would further increase heterogeneity in our study. A growing number of studies are using correlative statistics to examine quantitative scoring systems [42]. Because we used an ordinal outcome measure, we could not include these studies. Nevertheless, a meta-analysis focused on this type of data would be very useful. Finally, only patients with suspected IBD or known Crohn’s disease were included, possibly introducing observer bias, leading to over-grading of disease activity.

Assessment of study quality using the QUADAS tool showed overall moderate quality of the studies included in this meta-analysis. The domains of reference test and patient flow showed the highest risk of bias, while patient selection and index test domains showed the lowest. Concern about the applicability of patient selection and index and reference tests was generally low.

Recently, Vermeire et al. stated that MR enterography had become the reference standard for assessing small and large bowel disease activity [43]. Based on our results, we can agree with this statement. Considering the radiation exposure from CT, it is not appropriate for repeated examinations, even with present-day reduced ionizing radiation exposure per examination, although it still has an important role in the acute setting [44]. Compared to endoscopy, MRI is non-invasive and able to investigate trans- and extramural disease, making it possible to evaluate both the small bowel and colon in one examination. Steps are being taken to come to a more uniform evaluation of MRI in Crohn’s disease, which may improve accuracy [42, 45]. Furthermore, the versatility of MRI may be advantageous with new sequences being studied.

In conclusion, CT and MRI can both be used to grade disease activity in Crohn’s disease, while no conclusions can be made on US and scintigraphy due to the limited and inconsistent data.