Our Institutional Review Board approved the study and waived the requirement for individual consent for retrospective analysis of prospectively acquired patient data collected as part of clinical trials/routine care (R&D No: 12/0195, 16 July 2012).
Two temporally separated cohorts were built: one for generating models (training cohort) and another for temporal validation (validation cohort).
For the training cohort, a trial dataset of 330 patients was interrogated. Full details of the trial have been previously reported . In brief, inclusion criteria were (i) men who underwent previous transrectal ultrasound biopsy whereby suspicion remained that cancer was either missed or misclassified and (ii) men suitable for further characterization using transperineal template prostate mapping (TPM) biopsy. Exclusion criteria were (i) previous history of prostate cancer treatment and (ii) lack of complete gland sampling or inadequate sampling density at TPM.
Selection criteria for building the training cohort were (i) 3-T mp-MRI, performed between February 2012 and January 2014; (ii) Likert  ≥ 3/5 index lesion on mp-MRI, defined on the trial proforma following multidisciplinary tumor board discussion, whereby lesions were assigned to be of TZ or PZ origin; and (iii) TPM and targeted index lesion biopsy confirming tumor (defined as Gleason score 3 + 3 or greater). Gleason pattern 5 was not found in any samples. The index lesion was defined as the most conspicuous lesion with the highest Likert score (3, 4, or 5). This cohort consisted of 72 Gleason 4 containing lesions (38 Gleason 3 + 4, 34 Gleason 4 + 3), and 27 Gleason 3 + 3 lesions for the PZ whereas the TZ had 22 Gleason 4 containing lesions (20 Gleason 3 + 4, 2 Gleason 4 + 3), and 27 Gleason 3 + 3 lesions. A flow diagram for patient selection is shown in Fig. 1.
The validation cohort consisted of 30 consecutive men: 20 PZ (6 Gleason 3 + 4, 4 Gleason 4 + 3, and 10 Gleason 3 + 3) and 10 TZ (3 Gleason 3 + 4, 1 Gleason 4 + 3, and 5 Gleason 3 + 3) with the same selection criteria and scanning protocol as in the training cohort, performed between June and December 2015.
Table 1 shows the age, the PSA, and the gland and tumor volume of the patients in the two cohorts.
Multiparametric MRI protocol
Mp-MRI was performed using a 3-T scanner (Achieva, Philips Healthcare) and a 32-channel phased-array coil. Prior to imaging, 0.2 mg/kg (up to 20 mg) of a spasmolytic agent (Buscopan; Boehringer Ingelheim) was administered intravenously to reduce bowel peristalsis. Mp-MRI was compliant with the European Society of Uroradiology  guidelines. Full acquisition parameters are shown in Table 2.
Ultrasound-guided TPM ± targeted biopsy acted as the reference standard for the training cohort using cognitive MR-guided registration. A systematic biopsy of the whole gland was performed through a brachytherapy template-grid placed on the perineum using a 5-mm sampling frame. Focal index lesions underwent cognitive MRI-targeted biopsies at the time of TPM. A genitourinary pathologist with 12 years of experience analyzed biopsy cores blinded to the MRI results. There were no instances of non-targeted samples yielding higher Gleason grades than targeted specimens.
TPM and targeted biopsies were chosen as the reference standard because they are superior to transrectal ultrasound biopsy, are the sampling method of choice in the active surveillance population, and avoid the spectrum bias associated with a prostatectomy reference standard , which favors patients with aggressive disease.
Multiparametric MRI review
Mp-MRI images were qualitatively assessed on an Osirix workstation by three board-certified radiologists independently (readers SP, MA, and SP). Radiologists were fellowship-trained, with 10, 2, and 3 years of experience in the clinical reporting of mp-prostate MRI, with each year comprising more than 100 mp-MRIs per year with regular attendance at weekly multidisciplinary tumor board meetings . Radiologists were informed of the PSA level and subjectively evaluated whether the index lesion contained a Gleason pattern 4 component or not (i.e., a binary classification), based on their personal evaluation of imaging characteristics, as developed from years of prostate MRI reporting and pathological feedback at multidisciplinary tumor board meetings.
Radiologists were aware that high signal on b = 2000 s/mm2 DWI with corresponding low ADC value, low T2W signal, and avid early contrast enhancement compared with normal prostatic tissue suggest higher grade disease [10, 17].
Extraction of mp-MRI-derived quantitative parameters
MR datasets were analyzed with MIM Symphony Version 6.1 (MIM Software Inc), which carries out rigid translational co-registration of volumetric and axial T2W, ADC, and DCE images for semi-automatic registration, after which subsequent manual refinement can then be performed.
A fourth board-certified radiologist (EJ) with 3 years of experience in the quantitative analysis of mp-prostate MRI was blinded to the histopathology results and the opinion of the other radiologists manually contoured a volume of interest for each index lesion and recorded the mean signal intensity (SI) of each volume on the axial T2W, ADC, and DCE images at all time points. Contouring was performed on T2WI and manually adjusted on the DCE images and ADC maps to account for distortion and registration errors. A typical contoured lesion is shown in Fig. 2. In order to standardize signal intensity between subjects, normalized T2 signal intensity metrics were calculated by dividing the signal intensity of the lesion by that of the bladder urine .
Early enhancement (EE) and maximum enhancement (ME) metrics were derived from the DCE-MRI signal enhancement time curves. EE was defined as the first strongly enhancing postcontrast SI divided by the precontrast SI, and ME as the difference between the peak enhancement SI and the baseline SI normalized to the baseline SI .
Clinical features of the tumor volume, gland volume, and PSAd were also selected as features to include in the model development, whereby the first two features were measured using tri-planar measurements and the prolate ellipsoid formula .
Machine learning models
Five classification models were tested, namely logistic regression (LR) , naïve Bayes (NB) , support vector machine , random forest (RF) , and feed-forward neural network (FFNN) .
To validate each model, a fivefold cross-validation was applied, whereby data was split into five folds, with four folds being used for training and one for testing the classifiers. This was repeated for five trials with each fold used once as a test set. At each trial, a receiver operator characteristic (ROC) curve was built for both the training and test set and the corresponding AUC calculated. The values of the AUCs for the five trials were averaged to produce a single estimate, and the process was repeated for 100 rounds using a different partitioning of the data for each repetition.
Since the performance of machine learning classifiers decreases when the data used to train the model is imbalanced , which applies to the PZ cohort in our study (72 Gleason 4, vs. 27 Gleason 3 + 3), a resampling technique called Synthetic Minority Over-sampling TEchnique (SMOTE)  was applied to the PZ training cohort. Here, the minority class is over-sampled by introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors of each minority class sample. After applying SMOTE to the PZ training cohort, 45 synthetic samples belonging to the class of 3 + 3 Gleason cancers were added and this new re-balanced data was used to generate the classifiers. SMOTE was not applied to the TZ training cohort as this cohort was sufficiently balanced.
The Statistics and Machine Learning Toolbox of MATLAB (version R2017b 18.104.22.1683579, MathWorks) was used for all algorithms, using one hidden layer of 20 neurons for FFNN.
Model feature selection and internal validation
The best combination of features was derived from the training cohort dataset using the correlation feature selection (CFS) algorithm  for TZ and PZ lesions, denoted as SELTZ and SELPZ respectively. CFS determines (i) how each feature correlates with the presence of Gleason 4 tumor, and (ii) whether any of the selected features are redundant due to correlations between them. Redundant features were removed from the SELTZ and SELPZ feature sets.
As Fig. 3 shows, to test whether CFS was effective, we compared the performance of classifiers trained using all features (denoted ALL) with the performance of the classifiers trained using only SELTZ and SELPZ.
Best model selection and temporal validation
Using SELTZ and SELPZ, we applied a fivefold cross-validation to compare the classifiers and to select the best performing zone-specific models, defined by the highest AUC. A flow diagram of the comparisons is shown in Fig. 4.
Once the best performing models were selected for each zone, their performance was compared with that of the three radiologists. Mean values of sensitivity and specificity were compared with that obtained by the classifiers at three cut-off points of interest on the ROC curves. In particular, we considered:
The point characterized by a specificity of 50% (point_50), which is of interest from a clinical standpoint as we can tolerate classifying 50% of patients as false-positives provided a high level of sensitivity (i.e., low numbers of false-negatives) is maintained.
The point characterized by a specificity equal to the mean specificity of the three radiologists who assessed the images (point_RAD), we used this point to compare our models to the performance of an experienced radiologist.
The point closest to the point with sensitivity and specificity equal to 1 (point_01), we chose this point as it is characterized by the best trade-off between specificity and sensitivity. For all the three points, we derived the corresponding thresholds on the ROC curve obtained on the training set and then applied these thresholds to compute the sensitivity/specificity of the classifiers on the test set.
Finally, we applied a temporal-separated validation whereby the best performing classifier was trained on the training cohort and tested on the validation cohort. SMOTE was applied to the training set before using it to train the classifier for all the analyses performed in the PZ cohort.