Introduction

Multi-parametric magnetic resonance imaging (mpMRI) is emerging as an important modality in prostate cancer diagnosis [13]. Several studies have shown that in patients with initial negative trans-rectal ultrasound-guided biopsies (TRUSGB) and persistently elevated prostate-specific antigen (PSA) expert readers using mpMRI find cancer in 38–59 % of the cases [4, 5]. Furthermore, it has been shown that mpMRI correctly upgrades TRUSGB-detected cancers in up to 30 % of cases [6]. Several other studies found that the negative predictive value of mpMRI is high enough to avoid TRUSGB in 30–50 % of men with persistently elevated PSA [7, 8]. However, one of the main limitations for broader acceptance of mpMRI is the lack of required expertise, especially in the acquisition and interpretation of the MR images [1, 9, 10].

In order to improve the acquisition and interpretation of mpMRI, the European Society for Urogenital Radiology (ESUR) established initial guidelines for acquisition and standardized interpretation of mpMRI (PIRADS) [1]. These guidelines have been evaluated by several groups, for detection of cancer both prior to biopsy [8, 11] and after initial negative TRUSGB [1215]. There are, however, still two major issues in current prostate MRI: determining which cancers need treatment (assessment of aggression) and the large amount of false positives resulting in unnecessary biopsies. Computer-aided diagnosis (CAD) might be able to address these problems.

The purpose of this study is twofold. One is to provide a clinical investigation of the effect of CAD [16] on the diagnostic accuracy of prostate mpMRI reporting via independent combination of PIRADS scores and CAD prediction into a radiologist/CAD combination score. The performance of this combination score is evaluated in a comparatively large cohort of patients with MR-guided MR-biopsy histopathological outcome as reference standard. Secondly, we investigate the ability of CAD to estimate prostate cancer aggressiveness.

Materials and methods

Patient data

An institutional review board (IRB) waiver applies to this study as it uses anonymized imaging data and MR-guided biopsy results obtained through regular clinical care. In total 130 consecutive patients from 1 January to 1 September 2013 who received both an mpMRI and a subsequent MR-guided MR-biopsy at our institution were included. The inclusion criteria for the detection mpMRI were an initial negative TRUSGB and persistently elevated PSA (consistently above 4 ng/mL).

Multi-parametric MRIs were acquired according to the ESUR guidelines and included T2-weighted imaging in three orthogonal directions, diffusion-weighted imaging and dynamic contrast-enhanced imaging. All MRIs were performed at a Siemens 3 T MRI scanner (TRIOTIM or Skyra) without an endo-rectal coil. Full acquisition details are presented in Table 1.

Table 1 MRI sequence details for the different types of acquisitions

Each mpMRI was regularly, prospectively read by one radiologist out of a group of seven radiologists who reported prostate MRI in our clinic. Experience levels of the reporting radiologist ranged from moderately experienced (2 years) to very experienced (J.B., 20 years). Details on the amount of cases read by each radiologist are presented in Table 2. The ESUR prostate imaging reporting and data system (PIRADS) classification was used to assign a five-point PIRADS score to one or more lesions.

Table 2 Overview of radiologist reading cases in the study cohort, including amount of cases read (out of 107 included studies) and years of experience

Each MR study was reported using a dedicated prostate MR workstation that allowed radiologists to indicate one or more areas of suspicion with a sphere enclosing the lesions. If no suspicious areas could be identified, a location was marked which was deemed normal/benign and assigned a PIRADS score of 1 or 2. This is performed in routine clinical care for accountability, to ensure that each case has been read and reported. Typically, PIRADS 1 or 2 lesions do not get biopsied. Sometimes a PIRADS 2 lesion was also biopsied, when a PIRADS 3 or higher lesion was also identified and the patient was thus already scheduled for MR-guided MR biopsy. The locations and scores were automatically recorded in a database.

MR-guided biopsies were performed by medical experts with multiple years of experience in MR-guided prostate biopsies. At the start of the biopsy procedure a T2-weighted volume and an ADC map were acquired according to the prostate cancer detection protocol (Table 1). These sequences were used to relocate the lesions in the prior detection MRI. After lesions have been identified, a needle guide is inserted transrectally. Consecutive sagittal and transversal MRIs are made during repositioning of the needle guide to assess whether the correct position has been reached. Once the correct position has been reached a biopsy needle is inserted and a biopsy taken. To verify the biopsy location, sagittal and transversal images were made with the needle in situ. Subsequently, biopsies were histopathologically processed, inspected and graded by an experienced uropathologist (17 years of experience in prostate pathology).

Computer-aided diagnosis system

The computer-aided diagnosis system evaluated in this paper was previously presented in [16]. First, the system computes quantitative voxel features, which were designed to capture characteristics described by the PIRADS guidelines. A full feature listing can be found in Table 3. These voxel features are then fed to a random forest classifier trained to determine a continuous likelihood score for each voxel to identify cancer, resulting in a likelihood image. Subsequently, in a second stage, the system used the centre of the sphere indicated by the radiologist as a starting point for lesion segmentation, which is performed on the pre-computed likelihood image. After lesion segmentation histogram statistics are calculated on the voxel features within the lesion (e.g., percentiles, mean, standard deviation). In addition, local contrast is calculated by comparing the voxel feature values within the lesion to values outside the lesion. Symmetry is calculated by comparing the feature values within the lesion to the feature values at the same relative position on the contra-lateral side of the prostate. The statistical, local contrast and symmetry features are then combined using a second random forest classifier trained to predict cancer likelihood per lesion. The system is able to take into account the zonal location of the lesion via the use of a probabilistic segmentation of the prostate zones as one of its features (Table 3). The construction of this probabilistic segmentation is detailed in [17].

Table 3 Descriptions of the voxel features used in the computer-aided diagnosis system

The CAD system was trained with independent, retrospective patient data (237 patients), which had no overlap with the data set used in this study. The retrospective data was acquired in a similar manner (same MRI protocol) as the evaluation data and also had MR-guided biopsy as the reference standard.

Combination of PIRADS score and computer-aided diagnosis (CAD) likelihood

The use of the system as proposed in this paper is presented schematically in Fig. 1. The initial identification of potential suspicious regions was performed by the radiologist, after which the radiologist and the CAD system gave independent scores on whether clinically significant cancer was present [1]. The radiologist did this by assigning a five-point PIRADS score, while the CAD system assigned a continuous likelihood score between 0 and 1.

Fig. 1
figure 1

Suggested workflow for the proposed computer-aided diagnosis (CAD) system. The biopsy decision can be made by the radiologist, another attending clinician or by using the combination score to independently combine the PIRADS score and the CAD likelihood

The reported scores of the radiologists (PIRADS) and CAD (likelihoods) were combined into a combination score via logistic regression, which is a technique to map multiple variables to one single, continuous outcome variable (between 0.0 and 1.0) in an independent manner. The regression model was created with SPSS (version 20.0.01, Chicago, IL, USA). The logistic regression was performed on the retrospective data that was also used to train the CAD system to ensure no bias would occur by training and testing on the same data. It was thus independent of the evaluation data used in this paper. Alternative methods of incorporating CAD results may be used in clinical workflow, such as asking a second radiologist to make a final decision based on the two scores, but these were not investigated in this paper.

Statistical evaluation

Radiologist-identified lesions were categorized into either benign or cancer based on the MR-guided MR biopsy outcome. Cancerous lesions were further subdivided into low-grade, intermediate-grade or high-grade cancer based on the MR-guided biopsy Gleason scores, similar to Vos et al. [18] and Hambrock et al. [19]. Our high-sensitivity MR-guided biopsy strategy has been shown to have a concordance of 95 % with prostatectomy Gleason grade [20].

We used two different settings for evaluation in this study: either benign versus cancerous or indolent versus aggressive lesions. In the latter case the benign and low-grade lesions are considered indolent and intermediate- and high-grade lesions are considered aggressive. These settings are summarized in Table 4. The CAD system and logistic regression model were constructed separately for each setting using the retrospective data.

Table 4 Mapping of Gleason scores to cancer grade

The statistical evaluation consisted of three parts. First, we investigated the hypothesized increase in predictive power of the combination score over the radiologist PIRADS score alone using the likelihood ratio test on the logistic regression models.

Second, the diagnostic performance of the CAD system, the radiologist PIRADS score and the combination score was evaluated using receiver-operating characteristic (ROC) analysis for both evaluation strategies. The significance of improvement for area under the ROC curve, and different sensitivity-specificity pairs at the different PIRADS thresholds, was tested using bootstrapping. A total of 10,000 bootstrap samples was used to obtain the 95 % confidence intervals (CIs). Bootstrapping was stratified according to patient to circumvent bias introduced by multiple lesions per patient.

To assess the effect of the zonal location of the lesions on the performance of the radiologist, the CAD system and the combination score the dataset was split into two sets, one set containing only central gland lesions and one set only containing peripheral zone lesions.

The effect of observer experience on the performance of the combination score was also assessed. The dataset was split into two groups, one containing the cases reported by the experienced radiologists (more than 5 years) and one containing the cases reported by the less experienced radiologists (less than 5 years, but more than 2 years).

Third, we correlated radiologist PIRADS, CAD score and the combined score to cancer grade. As cancer grade is an ordinal variable, Spearman’s rank correlation coefficient was used. The significance of differences in correlation coefficients was tested using Steiger’s z-test for dependent correlation coefficients [21]. For all significance tests a p-value threshold of 0.05 was chosen. SPSS (SPSS, version 20.0.01) and in-house developed tools for bootstrapping were used for all statistical analysis.

Results

Patient data

Of the initially included 130 patients, 23 were excluded, 18 due to previous treatment for prostate cancer, two for failed diffusion-weighted imaging, two because they did not undergo dynamic contrast-enhanced imaging and one patient because no biopsy was taken during the biopsy session. The 107 included patients had a median age and PSA level of 66 years (range 48–83) and 13 ng/mL (range 1–56), respectively, which is similar to other studies using patient data with the same inclusion criteria (initial negative TRUSGB and persistently elevated PSA) [1215]. Further details are summarized in Fig. 2 and Table 5.

Fig. 2
figure 2

STARD diagram of inclusion and exclusion criteria of the prospective patient cohort

Table 5 Characteristics of patients and biopsy specimens for the prospective cohort used to evaluate the potential added value of a computer-aided diagnosis system for the assessment of prostate cancer. For each group of lesions the numbers between brackets indicate the number of lesions in the peripheral zone and the central gland, respectively

In total 141 suspicious regions were identified in these patients. All these regions were biopsied under MR-guidance. Of these regions, 68 % were positive and 32 % were negative for prostate cancer. The zonal distribution of the lesions was almost equal, with 69 regions being located in the peripheral zone and 72 in the central gland. Gleason grades were 2 + 5 (1 %), 3 + 2 (2 %), 3 + 3 (27 %), 3 + 4 (38 %), 4 + 3 (13 %), 4 + 4 (5 %), 4 + 5 (10 %), 5 + 4 (3 %) and 5 + 5 (1 %). Further details about the distribution of the lesion grades can be found in Table 5.

The effect of CAD on radiologist performance

First, the logistic regression procedure showed that including the CAD system likelihood in addition to the radiologist PIRADS score resulted in a model with significantly improved predictive power (p < 0.001, likelihood ratio test) for both evaluation settings (benign vs. cancer and indolent vs. aggressive). Using the obtained regression coefficients we created a weighted combination score:

$$ \mathrm{Combination}\ \mathrm{score}=4.739\;C+1.06\;P $$

for the benign versus cancer setting and

$$ \mathrm{Combination}\ \mathrm{score}=3.886\;C+1.295\;P $$

for the indolent versus aggressive setting. C is the CAD system likelihood (ranging from 0 to 1) and P is the radiologist PIRADS score (ranging from 1 to 5). The regression models are visually represented in Fig. 3.

Fig. 3
figure 3

Visual depictions of the regression models to generate the combination score of the radiologist and the computer-aided diagnosis (CAD) system: (a) shows the model for the benign vs. cancer setting, (b) for the indolent vs. aggressive setting. The likelihood of cancer is indicating by the colour coding and the contour labels and ranges from 0 to 1. Green indicates low likelihood and red indicates high likelihood

The ROC analyses showed a significant increase (p < 0.05) in area under the ROC curve from 0.81 to 0.88 in the benign versus cancer setting and from 0.78 to 0.87 in the indolent versus aggressive setting when using the combination score versus only PIRADS (Fig. 4a, b and Table 6). Furthermore, this increase is not affected by the zonal location of the lesion under investigation (Fig. 4c, d, e and f and Table 7). The ROC analysis also shows that radiologists and CAD have a comparable diagnostic accuracy. Lastly, Table 6 also includes the increases in sensitivity and specificity obtained at specific PIRADS scores when using the combination score. For example, at PIRADS 4 without using CAD a sensitivity of 0.93 is attainable at a specificity of 0.37 in the indolent versus aggressive setting. However, when combining the PIRADS score with the CAD score we obtain a significantly increased sensitivity of 0.98 at a significantly increased specificity of 0.59 (p < 0.05). Last, we show that both less experienced and experienced readers can improve their performance by using CAD. In the indolent versus aggressive evaluation setting both groups improve significantly when using CAD with an increase in AUC from 0.76 to 0.85 for less experienced and from 0.78 to 0.87 for experienced readers (Fig. 5, Table 8). In the benign versus cancer settings both groups also improve, but only the less experienced readers significantly (AUC from 0.79 to 0.89).

Fig. 4
figure 4

Receiver-operating characteristic (ROC) curve showing the performance of the computer-aided diagnosis (CAD) system (orange) and the radiologist/CAD-system combination (blue). The shaded areas indicated the 95 % confidence intervals (CIs) as calculated using bootstrapping. The radiologist performance is indicated with point for the different PIRADS thresholds. The vertical error bars indicate the 95 % CI on the sensitivity and horizontal error bars indicated the 95 % CI on the specificity as estimated by bootstrapping. a, c and e are the results of the benign versus cancer evaluation setting, b, d and f are the result of the indolent versus aggressive settings. a and b show the results over all lesions, c and d only the peripheral zone lesions and e and f only the central gland lesions

Table 6 Sensitivity-specificity pairs and area under the receiver operating characteristic (ROC) curve for the radiologist and the computer-aided diagnosis (CAD)-radiologist combination including 95 % confidence intervals (CIs) and p-values determined by bootstrapping. Significant p-values (p < 0.05) are indicated in bold. The first part of the table contains the results for the benign vs. cancer evaluation setting whereas the second part of the table contains the results for the indolent vs. aggressive evaluation setting
Table 7 Receiver operating characteristic (ROC) analysis differentiated with respect to the zonal location of the lesions. Area under the ROC curve is reported for both the benign vs. cancer and indolent vs. aggressive evaluation settings. P-values measuring whether the increase in area under the ROC curve is significant when using computer-aided diagnosis (CAD) were calculated using bootstrapping. Significant p-values are indicated in bold
Fig. 5
figure 5

Receiver-operating characteristic (ROC) curve showing the performance of the combined score vs. the radiologist alone with respect to the level of experience. The shaded areas indicated the 95 % confidence intervals (CIs) as calculated using bootstrapping. The radiologist performance is indicated with point for the different PIRADS thresholds. The vertical error bars indicate the 95 % CIs on the sensitivity and horizontal error bars indicated the 95 % CI on the specificity as estimated by bootstrapping. a is the result of the benign versus cancer evaluation setting, b is the result of the indolent vs. aggressive settings

Table 8 Receiver operating characteristic (ROC) analysis comparing inexperienced and experienced readers (less or more than 8 years of experience with prostate MRI) when using computer-aided diagnosis (CAD). Area under the ROC curve including 95 % confidence intervals (CIs) are reported for both the benign vs. cancer and indolent vs. aggressive evaluation settings. P-values measuring whether the increase in area under the ROC curve is significant when using CAD were calculated using bootstrapping. Significant p-values are indicated in bold

Correlation of likelihood and cancer grade

Both CAD likelihood and PIRADS score correlate significantly with cancer grade, but the combination score shows the strongest correlation. This is confirmed when assessing the correlation coefficients. In the benign versus cancer setting, correlation was 0.534, 0.582 and 0.684 for CAD, radiologist and combination, respectively. In the indolent versus aggressive setting the correlation coefficient was 0.536, 0.582 and 0.694 for CAD, radiologist and combination, respectively. The increase in correlation when using the combination score instead of just PIRADS or CAD was significant (p < 0.01). The ability of each of the three scores to predict aggressiveness is visualized in Fig. 6 for both evaluation settings.

Fig. 6
figure 6

Relationship between computer-aided diagnosis (CAD) system likelihood and cancer grade presented by box-plots. One can observe a positive correlation between cancer grade and CAD system likelihood. a is the result of the benign vs. cancer evaluation setting, b is the result of the indolent vs. aggressive settings

Discussion

A recently developed CAD system for the computerized analysis of prostate MR was shown to have a similar diagnostic accuracy to well trained prostate MR radiologists. When combined with the PIRADS score into a combination score, diagnostic performance improved significantly. It is important to note that the CAD system was used in a regular clinical practice setting on a large cohort. This provides further evidence of the benefit of CAD in helping improve diagnostic accuracy.

The CAD system score showed a significant correlation (0.54) with cancer grade similar to the PIRADS score (0.58). A significantly higher correlation (0.69) was obtained by using the combination score. This correlation coefficient is the highest currently reported in literature [18, 22]. A noticeable difference with other multivariate aggressiveness correlation studies is that this study attains the correlation in a setting with radiologist-indicated regions instead of pathology pre-defined regions of interest, which is more similar to regular clinical practice.

Although the performance of both the radiologist and the CAD system were similar, they provide complementary information, as the combination score results in an improved ROC curve (blue curve, Fig. 4a, b). In both evaluation settings (benign vs. cancer and indolent vs. aggressive) the area under the ROC curve increased significantly (0.81 to 0.88, p = 0.013 and 0.78 to 0.87, p = 0.001, respectively). If we compare these results to those found in the literature, we observe similar values for performance of readers with CAD; however, the PIRADS performance is somewhat lower in our study (0.84 to 0.87 in [23], 0.85 to 0.91 in [24]). We believe these differences to be caused by the difference in reading setting and the fact that we used a substantially larger and more difficult cohort. Our study used prospective clinical reading, whereas these previous studies used a retrospective batch reading setting. Furthermore, previous studies were limited to patients scheduled for radical prostatectomy, which differs from the regular clinical population used in this study.

Furthermore, we showed that identifying clinically significant disease using CAD can improve both experienced and less experienced readers (increase of 0.09 in AUC for both groups). However, when discriminating any cancer from benign lesions only less experienced readers improved significantly, indicating that CAD might especially be helpful in identifying clinically significant disease. Note that direct comparison between the performance of less experienced and experienced readers cannot be performed in this study as both groups reported on different patients. However, we do not expect results to change as both groups evaluated a sufficiently large and similar subset of cases.

The added value of CAD did not seem to depend on the zonal location of the lesions. Increase in performance when using CAD was similar for both peripheral zone and central gland lesions. The CAD system performed equally well on peripheral zone and central gland lesions, indicating that it has successfully learned how to take into account the zonal characteristics of the lesions. CAD could possibly help shift the biopsy threshold from PIRADS 3 to PIRADS 4, which would lead to a significant reduction of MR-guided biopsies. At a PIRADS threshold of 3, CAD significantly increased the specificity in both evaluation settings (Table 1; 0.15 to 0.30, p = 0.020 and 0.09 to 0.26 p = 0.023). At the PIRADS 4 threshold a significant increase in sensitivity and specificity was found (0.93 to 0.98, p = 0.029, 0.37 to 0.59, p = 0.013) in the indolent versus aggressive setting. The latter indicates that by using CAD and increasing the biopsy threshold from PIRADS 3 to 4 almost no loss in detection of aggressive lesions (sensitivity 0.99 vs. 0.98) occurs, while a specificity improvement is obtained (0.59 vs. 0.26), reducing unnecessary biopsies. Of course this has to be investigated further in future clinical studies. One important aspect relevant to patient prognosis was not assessed by the CAD system; the presence or absence of extracapsular extension. Being able to identify this aspect would further enhance the applicability of the CAD system.

This study has some limitations. Firstly, MR-guided biopsy has a 95 % concordance with prostatectomy Gleason grade for Gleason 4 and 5 components [20, 25] and has been shown to be able to detect clinically significant cancer in men with previous negative TRUS biopsies [26]. Although in general concordance rates with prostatectomy Gleason grade are high, they are not perfect. Thus, some of the cancers in our study may be under- or overgraded. We expect the effect on our results to be minimal, as this only effects the indolent versus aggressive setting.

A second limitation is that each case in this study was read by one of seven radiologists. We know diagnostic accuracy is dependent on reader experience and thus our results depend on the average reader experience of the group. All our readers have had reasonable training and experience of at least 2 years [8, 11].

Third, due to the single reader per case and the prospective reading setting inter- and intraobserver variability and false-negative rates could not be assessed. As such, this study does not vacate the need for retrospective observer studies, in which these aspects could be assessed, but provides a different, more clinically realistic view on the added value of CAD. Furthermore, due to our comparatively large number of cases we were still able to show significantly improved diagnostic performance when using CAD.

The proposed method of implementing CAD in clinical practice (independent combination of PIRADS score and CAD likelihood into a combination score) might not be feasible, as radiologists or urologists will always have the final say. Nevertheless, we choose to perform independent combination to assess the potential observer-independent effect of CAD. In future work, one could assess the most optimal way for radiologists to incorporate CAD results in their reports.

PIRADS 1 and 2 lesions were generally not biopsied and therefore are only partially included in this study. This precludes assessment of the effect of the CAD system in those lesions. However, this has little impact on the results of this study. Of all seven biopsied PIRADS 2 lesions, none were categorized as cancer. The negative predictive value of PIRADS 1 and 2 scores is already so high that radiologist do not need computer aid for these PIRADS scores. The literature also confirms this assessment, with the studies by Thompson et al. [8] and Pokorny et al. [11] reporting MRI sensitivities and negative predictive values of 97 and 96.9 %, respectively. The CAD system has most potential in more accurately discriminating which PIRADS 3, 4 or 5 lesions require biopsy and the results at these scores are not affected by the lack of PIRADS 1 or 2 biopsies.

Last, the patient population in this study contained only patients for whom initial PSA tests and TRUS biopsies were inconclusive. As such, the results of this study cannot be directly translated to other patient groups (e.g., staging). However, due to the similar protocols for detection and staging MRI we expect results to be comparable. Furthermore, with prostate cancer guidelines in many countries now recommending MRI if PSA/TRUS results are inconclusive, we expect that the majority of prostate MRIs will be done for detection purposes.

Conclusions

In this paper the use of a CAD system in conjunction with the radiologist to accurately characterize prostate lesions was investigated. Results showed that a significant increase in diagnostic performance can be achieved when combining the radiologist PIRADS score and CAD system likelihood into a combination score. Furthermore, a significant correlation between CAD likelihood and cancer grade exists; this increases further when using the combination score.