With a crude incidence of 134.3 per 100,000, prostate cancer is the most common cancer in men, and the second-biggest cause of cancer mortality [1, 2]. The quoted incidence has increased in recent years; however, this may be due to the use of prostate-specific antigen (PSA) blood testing. The majority of suspected cases with either a high PSA, abnormal digital rectal examination (DRE), or suggestive symptoms, will undergo a transrectal ultrasound guided biopsy (TRUS) to confirm and grade a histopathologic diagnosis [3]. If this is positive and the patient is a candidate for radical treatment, they will receive multiparametric magnetic resonance imaging (mpMRI) to assess the extent of cancer growth. However, there are now a substantial number of centers choosing pre-biopsy mpMRI followed by a more-targeted biopsy.

Multiparametric MRI is a well-established imaging modality for assessing prostate cancer, predominately to exclude extra-glandular spread and to judge how much of the prostate is involved. It consists of multiple sequences, including T1- and T2-weighted imaging (T2WI), diffusion-weighted imaging (DWI) and, in some instances, Dynamic contrast-enhanced (DCE) imaging. Multiple meta-analyses have proven DWI to have good diagnostic accuracy [4,5,6]; its contrast is governed by numerous technical parameters, one of the most important of these is the diffusion-weighting factor, or ‘b-value’. The b-value reflects the strength and timings of magnetic field gradients applied to the patient, and acquisition of multiple b-values permits calculation of an apparent diffusion coefficient (ADC) map, which gives a quantitative measure of tissue diffusion that has been shown to have an inverse correlation with tumor Gleason score [7]. Currently the recommendation is to use at least two b-values, one of 50–100 s/mm2, 800–1000 s/mm2 and if possible 1400–2000 s/mm2 [8, 9]. Theoretically, increasing the maximum b-value results in a better contrast-to-noise ratio (CNR) because there is greater suppression of normal prostate tissue signal, so resulting tumors are more apparent. However, the tradeoff is a reduced signal-to-noise ratio (SNR). Even though b-values > 1400 is recommended, there is little evidence supporting this and there is no widely accepted optimal “high b-value.” In a previous meta-analysis, Wu et al. showed no benefit from increasing b-value but only one paper in the analysis used b-values of over 1000 [4]. A multitude of recent studies have shown high sensitivity and specificity with higher b-values using both visual and ADC value assessments [10,11,12]. For clinical relevance, we hope to investigate the diagnostic accuracy achievable by visual assessment of DWI in combination with T2WI at high b-values > 1000 s/mm2.

Materials and methods

This review was registered with the PROSPERO International prospective register of systematic reviews (reference number: 42016036196) prior to commencement [13]. The review was carried out in accordance with the preferred reporting items for systematic reviews and meta-analysis (PRISMA) guidance [14].

A systematic review of the literature was independently undertaken by two reviewers, who identified studies that investigated the diagnostic accuracy of DWI and T2WI MRI in the detection of prostate cancer. Searches were performed using MEDLINE and EMBASE electronic databases, as well as OpenSIGLE to explore sources of unpublished gray literature. The Science Citation Index was used to identify articles which cite those identified with the original search terms. Once eligible studies were found, their reference lists were manually searched for further potential papers. The search strategy for MEDLINE, including Boolean operators and MeSH terms, is presented in Table 1; the same search strategy was used for each database with alterations to suit. All studies were included up to the date of the search: 1st of September 2017.

Table 1 MEDLINE search terms and strategy

Eligibility

The eligibility criteria for the studies included within the systematic review were that they used both DWI and T2WI MRI in combination for the assessment of prostate cancer; they were applied for the assessment of the pretreatment patient population with a histopathologic reference standard, be that biopsy or radical prostatectomy; they reported sufficient information to produce a 2 × 2 table (true positives, false positives, false negatives, and true negatives) for calculation of sensitivity and specificity; they were published in English; and they assessed more than ten individual patients. To be included, both T2WI and DWI sequences needed to be assessed visually, with both sequences used to assess for tumor presence rather than just for localization. The choice of scoring system, such as Likert or PI-RADS, and whether a sector-based or whole gland assessment was conducted did not affect eligibility. Articles were excluded if they did not satisfy the inclusion criteria above, or if they used a combination of imaging sequences other than DWI and T2WI so that individual data for the desired combination could not be extracted. They were also excluded if an ADC cutoff value was used to discriminate malignant from benign tissue as opposed to visual assessment by certified radiologists. Studies were not excluded by country of origin, age of patients or study design.

Study identification

Initially papers were reviewed by relevancy of title and then abstract. Residual articles had their full text reviewed against the inclusion and exclusion criteria. This was also done independently by the same two reviewers. Any disagreement was solved by consensus or a third expert reviewer if necessary.

Data extraction

The following data were extracted from each eligible study: year of publication, country of origin, patient group, number of patients, average age, and PSA, study design (retrospective or prospective) and the histopathologic reference standard used. Further information on the imaging specifications was also gathered: field strength, coil used, field-of-view, b-value set, and whether they visually assessed DWI source images, ADC maps, or both, for each patient. True positives, false positives, false negatives, and true negatives were also extracted for pooling results. In the case of multireader studies, the most experienced was chosen for data extraction. When insufficient data were available, reviewers manually calculated them from other reported statistics, when possible. All data extraction was independently verified by two reviewers.

Quality assessment

The quality of the individual included paper’s methodology was assessed with the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool, a validated tool specifically designed to critically appraise diagnostic accuracy studies [15]. This was also undertaken independently by two reviewers and disagreement resolved with consensual discussion consulting a third expert reviewer if a consensus could not be met.

Statistical analysis

The sensitivity and specificity with 95% confidence intervals (CIs) were calculated for each included study using the extracted details of the 2 × 2 tables, and forest plots produced.

Initially, heterogeneity of studies was examined visually using the data extraction tables. Then, statistical analysis was performed using the inconsistency value (I 2) and Q statistics of the Chi squared value, for which an I 2 value > 50% or p value < 0.10, respectively, represents significant statistical heterogeneity. In these cases, a random-effects model was applied to data pooling. Pooled results for sensitivity, specificity, and diagnostic odds ratio (DOR) with 95% CIs, and a summary receiver-operating characteristic (sROC) curve were also presented.

To explore predictable sources of heterogeneity between the included studies, sensitivity and 1-specificity were plotted on an ROC plane to visually assess the presence or absence of a ‘shoulder arm’ shape, which indicates a threshold effect. This was also tested statistically with the Spearman correlation coefficient of the logit of sensitivity and logit of (1-specificity), with a p-value < 0.05 suggesting a threshold effect. Subgroup analysis was performed for; b-values (< 1000, 1000 and > 1000 s/mm2), field strength (1.5T and 3T), coil type (endorectal and body), method of assessment (DWI source images, ADC or both), reference standard (biopsy and radical prostatectomy), tumor zone (peripheral or transitional zone) and study design (retrospective and prospective). If possible raw data was separated from individual papers for each subgroup. Pooled sensitivities, specificities, positive and negative likelihood ratios, and meta-regression of diagnostic odds ratios were performed for these subgroups with a p-value < 0.05 deemed as statistically significant.

Publication bias was not assessed as there is currently no recognized or appropriate method that does so with sufficient power for diagnostic accuracy studies, and the impact of publication bias is presently unknown for studies of this type [16].

All statistical analysis was performed using Meta-DiSc (version 1.4, Javier Zamora).

Results

Search results

With the above-presented search strategy, 2825 citations were discovered, and after duplicates were removed, there were left 1880 unique articles. A total of 33 studies were included in the final analysis after reviewing against the eligibility criteria. The PRISMA flowchart of the search results is presented in Fig. 1.

Fig. 1
figure 1

PRISMA flow diagram. ADC, apparent diffusion coefficient; DWI, diffusion-weighted imaging; T2WI, T2-weighted imaging

Quality assessment

The full results of the QUADAS-2 appraisal are presented in Table 2. The strengths across the included studies were that the vast majority used consecutive patient selection with appropriate inclusion and exclusion criteria. However, two studies [17, 18] limited their investigation to transitional zone tumors and another [19] to patients with ‘low risk’ cancer. Therefore a subgroup analysis was deemed particularly important to assess the differences between peripheral and transitional zone tumors. Another strength was that all index tests used were applicable to clinical practice, without any nonstandard imaging methods. All but one study imaged patients after a positive biopsy, while patients studied by Tanimoto et al. had a pre-biopsy MRI [20]. A number of studies did not state the timings between biopsy and MRI [21,22,23,24], which could have implications if the timing was too long causing a disparity between the images and histopathology correlation or too short resulting in an increased incidence of post-biopsy hemorrhage which might limit accuracy. Kitajima et al. [25] and Morgan et al. [26] reported delays between biopsy and imaging much less than the recommended six weeks [27]. The predominant weakness of included studies was applicability of the patient groups, as studies were often limited to patients who underwent radical prostatectomy. These patients tend to be younger, with a narrower range of tumor staging. However, this is acceptable to obtain a reference test with low bias.

Table 2 QUADAS-2 quality assessment of included studies

Study characteristics

The data extracted for study characteristics are described in Tables 3, 4, and 5. There were 2949 patients across the 33 studies. The mean age (range) was 65.1 (41–86) years, and PSA was 9 (0.4–130) ng/mL, respectively. The majority of studies (n = 20) used a retrospective study design as opposed to prospective (n = 13). Most of the studies (n = 19) used 3T field strength, thirteen studies used 1.5 T, and one study used both. Maximum b-values across the studies ranged from 600 to 2000 with the majority using 1000. Nine studies used an endorectal coil. Nine studies used DWI source images for diagnosis, while seven used ADC maps and seventeen used both. Most studies (n = 20) used radical prostatectomy as the reference standard while seven used TRUS biopsy, two MRI guided biopsy, one transperineal biopsy and another used a mixture of TRUS biopsy and radical prostatectomies.

Table 3 Principle characteristics of included studies
Table 4 Imaging and methodological characteristics of included studies
Table 5 Diagnostic performance of included studies

Meta-analysis

Visual assessment of the data extraction tables indicated they were homogeneous enough to undertake a meta-analysis with pooling. The pooled sensitivity (Fig. 2) and specificity (Fig. 3) of all included studies were 0.69 (95% CI 0.68–0.69) and 0.84 (95% CI 0.83–0.85), respectively. The pooled DOR was 12.27 (95% CI 9.60–15.68). The sROC (Fig. 4) gave an AUC of 0.839, indicating good diagnostic accuracy.

Fig. 2
figure 2

Forest plot of sensitivity for detecting prostate cancer including 95% CI, I 2 value, and Q statistic. CI, confidence interval; I 2, inconsistency value

Fig. 3
figure 3

Forest plot of specificity for detecting prostate cancer including 95% CI, I 2 value, and Q statistic. CI, confidence interval, I 2, inconsistency value

Fig. 4
figure 4

Summary receiver-operating characteristic (SROC) curve for the detection of prostate cancer. AUC, area under the curve

The I 2 value and Chi-square Q were 94.6% and 882.53 (p < 0.001), respectively, for sensitivity and 96.7% and 1446.59 (p < 0.001) for specificity, indicating significant statistical heterogeneity. The ROC plane (Supplementary Fig. 1) did not show a ‘shoulder-arm’ shape; however, the Spearman rank coefficient of the logit of sensitivity against logit of (1-specificity) was 0.335 (p = 0.018), indicating there could be heterogeneity due to a threshold effect.

Sub-group analysis

The highest DORs were obtained when using ADC maps with or without DWI for tumors assessment and for b-values > 1000 s/mm2. Significantly higher sensitivity was achieved using b-values > 1000 s/mm2, 3T field strength, assessing PZ tumors, studies with a retrospective design and those using biopsy as a reference standard. Specificity improved significantly with a 1.5T field strength, assessing TZ tumors, using ADC maps with or without DWI and those studies using radical prostatectomy as the reference standard. The complete subgroup analysis is shown in Table 6.

Table 6 Subgroup analysis and meta-regression

Discussion

The findings from this study show the diagnostic accuracy of DWI and T2WI of prostate cancer is good when using visual assessment. The greatest diagnostic accuracy is achieved with b-values > 1000 s/mm2, and when assessing lesions with both DWI source images and ADC maps, although the interplay between sensitivity and specificity can be significantly altered by the choice of field strength and by whether tumors originate from the peripheral or transitional zone. The overall strength of the evidence on which this analysis was based was graded as good by the QUADAS-2 critical appraisal tool [15]. However, there was a high degree of unknown statistical heterogeneity, so care should be taken when interpreting these results, and even though this review cannot specify an optimal imaging protocol, it does highlight the likely important factors to be considered.

Our pooled results match those of meta-analyses investigating T2WI and DWI by Wu et al. and Tan et al.; this is likely due to the large overlap of included studies [28, 29]. Compared with Godley et al. and Jie et al. who analyzed the use of DWI alone, we observed a higher sensitivity but lower specificity [5, 30]. However, when we compare the results for just peripheral zone tumors, our pooled results are similar. This would suggest that the addition of T2WI improves the sensitivity for diagnosing transitional zone tumors; however, neither Godley nor Jie et al. presented a subgroup for TZ tumors or comparison [10]. This finding supports the present consensus that T2WI with DWI should be the predominant imaging protocol for diagnosing TZ tumors [9].

We observed a significant increase in sensitivity using a maximum b-value > 1000 s/mm2, and improved specificity with a maximum b-value of ≥ 1000 s/mm2. The improved contrast-to-noise ratio at higher b-values, resulting from the relative suppression of normal prostate tissue, would explain the increase in sensitivity by making tumors more visually apparent. Two of the studies [23, 24] also used computed high b-values. These synthetic data extrapolated from low b-value datasets showed relatively decreased sensitivity and increased specificity compared to the equivalent acquired b-values. There has been limited research comparing the diagnostic accuracy of computed DWI to standard DWI, but the method shows promise with reduced distortion and ghosting and improved tumor conspicuity [31, 32].

We also note that all studies using b-values > 1000 s/mm2 were limited to a maximum b-value of 2000 s/mm2, except the study by Kuhlet al [33]. Wang et al. and Metens et al. found b-values of 1500 s/mm2 gave a better tumor contrast and image quality than b-values of 1,000 or 2000 s/mm2and Kuhl et al. using a b-value of 1400 s/mm2, produced some of the highest sensitivities and specificities [33,34,35]. However, more data on the diagnostic accuracy of b ≈ 1,500 DWI are required. Furthermore, the maximum b-value, the minimum b-value, and the number of b-values have all been shown to have a strong influence on the calculated ADC values [36]. However, there is little evidence about their impact on diagnostic accuracy with visual assessment [10, 36].

All but two of the included studies in this analysis used b = 0 s/mm2 as the minimum b-value, but the number of b-values ranged from two to seven. Thörmer et al. found that using just two b-values and a minimum b-value of 50 s/mm2 gave an improved qualitative image score versus data with a minimum b-value of 0 s/mm2 [37]. However, they tested only a limited number of combinations, and used a maximum b-value of just 800 s/mm2. The significant heterogeneity of b-value choice in the included studies makes it extremely difficult to provide a conclusion that high b-values are indeed superior for diagnostic accuracy. The individual studies that tested multiple b-value sets on the same cohort do, however, show improved diagnostic accuracy using b = 2,000 as opposed to 1000 or lower. Further studies directly comparing b-value sets of different maximum, minimum, and a number of intermediary b-values would be required to make a stronger recommendation of b-value choice.

DOR was not significantly different between 1.5T and 3T studies (p = 0.418), but 3T studies showed a significantly higher sensitivity and significantly lower specificity than those performed at 1.5T. Higher field strengths have the advantage of increased SNR, which can be traded for better spatial and temporal resolutions; they also lead to increased susceptibility artifact and signal heterogeneity, and there is conflicting evidence with respect to the categorical advantage of 3T over 1.5 T [38]. There is a trend toward better diagnostic accuracy with 3T in our study, although this may be because these systems allow the use of higher b-values, which improve diagnostic accuracy. This result reflects the recommendations of PIRADS v2 that 1.5T and 3T are both adequate, but 3T is regarded optimal if available [9].

For a few of the studies, it was possible to separate the results for PZ and TZ, and we found significantly higher sensitivity for the PZ, but higher specificity for TZ. Often TZ tumors are of a lower grade than those found in the PZ, so they may be less apparent on imaging [39, 40]. There is also difficulty in differentiating malignancy from benign nodules common in the TZ, which are often heterogeneous and can demonstrate restricted diffusion. Along with the relative rarity of TZ tumors this may explain the drop in sensitivity but the overall DOR was not significantly different. It may be that different imaging parameters are needed for optimal diagnosis of peripheral or transitional zone disease.

Our results showed a significant increase in both sensitivity and specificity when using ADC maps with or without DWI source images for diagnostic assessment, as opposed to using DWI source images alone. There are many advantages to using ADC maps which might explain this change. Firstly, ADC maps give a quantitative measure of tissue diffusion, and are particularly useful in differentiating areas which have high signal on DWI images due to T2 shine-through, such as post-biopsy hemorrhage; this leads to reduced false positives and improved specificity versus weighted images. The ADC value can also be used to help confirm malignant lesions, which have low ADCs due to restricted diffusion, and this would explain the higher sensitivity seen.

Retrospective studies investigated men with previously confirmed prostate cancer, and therefore the readers knew there was cancer present in each prostate examined. This may cause the readers to be more liberal with diagnosing suspicious lesions in borderline cases where there were no other lesions in the gland, explaining the significantly higher sensitivity.

Using radical prostatectomy as the reference standard allows the assessment of individual tumors within the gland and is a more accurate method of defining tumor. TRUS biopsy is ‘blind’ and only samples a small area of the prostate, with a 20–30% false negative rate. This would lead to increased false positives on imaging, decreasing the specificity as we observe in the subgroup analysis.

This systematic review has a few limitations. Our search was, first, limited by a finite number of databases although those chosen contain the majority of the relevant journals, and by exploring the gray literature and hand-searching references, we believe the search strategy was of sufficient sensitivity. Specific databases for the research question were sought, but none existed. Second, the search was limited to the English language. The majority of articles are published in English, but there may be data in other languages that we did not include in this meta-analysis. We did not assess for publication bias for reasons stated in the statistical analysis section. The degree to which publication bias impacts diagnostic tests is unknown [16]. We did not review the exact T2WI parameters for the included studies, which could explain some of the heterogeneity seen. Reader experience is another factor which we did not assess as it was often poorly reported and in different formats such as years practicing, years reporting prostate mpMRI, or number of prostate mpMRIs. It is recognized that reader experience is important in interpreting mpMRI and should be considered when implementing prostate imaging [41, 42]. Although diagnostic accuracy is very important for prostate cancer assessment, there are other aims of mpMRI which have not been assessed in this meta-analysis: for example, assessment of extracapsular extension, seminal vesicle or lymph node involvement, and the ability of mpMRI to quantify tumor size and volume. These findings are all used in staging of disease and are relevant to decisions about optimal imaging sequences.

In conclusion, the diagnostic accuracy of combined diffusion- and T2-weighted magnetic resonance imaging for prostate cancer detection is good, and our results support the PI-RADS v2 guidelines [9]. The use of b-values > 1000 s/mm2 seem to improve the sensitivity while maintaining specificity. However, due to large amounts of heterogeneity, we cannot categorically recommend using maximum b-values up to 2000 s/mm2 for all DWI protocols for prostate cancer assessment. Further large-scale study investigating optimal b-value maximum, minimum, and number of b-values for the visual assessment of prostate cancer is required.