Background

Prostate cancer is the most prevalent cancer and the second cause of cancer deaths among men in the USA [1]. A reliable prostate cancer screening approach that provides accurate risk assessment for targeting, diagnosis and treatment is still a critical need. The European Randomized Study of Screening for Prostate Cancer (ERSPC) reported that PSA-based screening has reduced the rate of death from prostate cancer by 20% [2], but is limited by a low specificity leading to over diagnosis at an estimated rate of 23 to 42% [3].

Transrectal ultrasound (TRUS) -guided needle biopsy of the prostate is recommended for patients with elevated serum PSA levels, an abnormal feeling prostate on digital rectal examination, or both. Given the heterogeneous and multifocal nature of prostate cancer, both indolent and clinically significant tumors may be found in the same gland. It is also known that tumors located in certain regions of the prostate are under sampled, missing dominant or high-grade tumors in these regions. In addition, prostate cancer stage upgrading or downgrading frequently occurs following repeat biopsies [4]. More recently, ultrasound-MRI fusion guided needle biopsies have been shown to improve precision in identifying, targeting and sampling prostate lesions of interest [5, 6].

Multi-parametric magnetic resonance imaging (mpMRI) has shown great promise as a non-invasive approach for prostate cancer detection [7], but the lack of uniform interpretation and reporting has led to high variability among radiologists [8].

But it has been generally agreed that, radiological appearance and the following interpretable descriptions are related to cancer progression [9, 10]. Radiologist training in the performance, interpretation and reporting of prostate imaging studies plays a major role in improving the performance of cancer detection in prostate cancer [11]. Various groups have developed radiological-based reporting scales for prostate cancer [12,13,14]. For example, a Likert reporting scale has been recommended by the Prostate Diagnostic Imaging Consensus Meeting (PREDICT) panel, and quantifies radiologist(s) opinion to a simplified 5 point scale [15].

The European Society of Urogenital Radiology (ESUR) first proposed the use of the Prostate Imaging Reporting and Data System (PI-RADS) as a way to standardize reporting of imaging consensus criteria. PI-RADS was later adopted by the American College of Radiology (ACR), and jointly proposed changes were formulated in a revision of the criteria [16, 17]. Findings on mpMRI are assessed on a 5-point categorical scale, based on the expert’s observational probability that a combination of findings on T2-weighted (T2WI) sequences, diffusion-weighted MRI (DWI) and dynamic contrast-enhanced MRI (DCE-MRI) correlate with the presence of a clinically significant prostate cancer at the specific location. The overall PI-RADS score considers a combination of multiple features obtained for the modality/sequences, such as nodule shape, margin and intensity. The PI-RADS assessment categories have a range of 1 to 5, with 5 being most likely to represent clinically significant prostate cancer. Previous studies [18, 19] have shown moderate inter-reader agreement with PI-RADS. The major pitfall in the clinical use of PI-RADS has been the degree of subjectivity of radiologists in study interpretations, leading to large variability in reported findings, and a suboptimal ability to characterize the nature and/or degree of malignancy in a lesion of interest [20].

Locating and discriminating clinically significant from insignificant cancers remains a challenge in prostate cancer screening. Current validation is primarily based on the pathologic Gleason score. Patients with Gleason score ≥ 7 ((3 + 4) or (4 + 3)) are considered clinically significant forms of cancer with increasing aggressiveness as the score increases to 8, 9 and 10 [20]. Recently, there have been numerous efforts to develop quantitative metrics for medical imaging to identify and describe abnormalities in radiological studies [21,22,23,24].

In this study, we propose to describe radiological traits independently for each mpMRI sequence on a numerical point scale. These traits were then taken in combination and related to pathological outcome of cancer aggressiveness (Gleason score) using a linear classifier approach. These combinations of traits were rigorously evaluated in a cross validation setting with multiple repeats. The semantic-based feature model was then compared to PI-RADS based predictors at different cutoffs to find clinically significant grade cancer.

Methods

Patients cohort

The study was approved by the Institutional Review Board (IRB) at the University of South Florida, and patient informed consent was waived for the retrospective analysis. All the patients were referred to the Radiology department for multiparametric MRI and targeted prostate lesion biopsy planning using the UroNav Ultrasound-MR Fusion Biopsy System (Invivo Corporation) at the H. Lee Moffitt Cancer Center. The patients were scanned using a Siemens 1.5 T MRI scanner with endorectal coil placement (Table 1). The inclusion criteria were as follows: a) availability of at least one targeted biopsy by the UroNav fusion system identified on the original interpretation, b) availability of mpMRI sequences (T2WI, DCE, DWI/ ADC) suitable for PI-RADS (version-2) scoring, and c) no image related limitations; i.e., post-biopsy hemorrhage, motion artifacts, et al.

Table 1 Clinical characteristic of the study cohort

The exclusion criteria include patients with prior localized treatment such as external beam radiation therapy, brachytherapy or cryoablation. The data curation step resulted in excluding 24 patients from the initial list, leaving 103 patients (167 biopsies) qualifying for the study cohort. We had 90 biopsies (65 unique patients) with Gleason scores ≥6, 33 of those biopsies had Gleason scores equal to 6, and 57 biopsies had Gleason scores ≥7. The rest of the 77 biopsies were negative for cancer (benign). Data extracted included age, race, smoking status, other cancer history, family cancer history, PSA level and board certified pathologist evaluated the cancer status and gleason scores for the slides. Multi-parametric MRI scans (T2WI, DCE, DWI, ADC) were downloaded from the Picture archive communication systems (PACS). Semantics were scored using offline DICOM (digital imaging and communication in medicine) viewers with prostate specific window settings.

Radiologist marked biopsy targets

The clinical radiologist marked most aggressive target locations on the mpMRI scans and converged based on consensus reading with a fellow radiologist on duty. The markings were carried out using the commercial prostate biopsy system (Uronav/DynaCAD, Invivo inc, FL) that integrates the software modules to the biopsy hardware, that includes real time ultrasound (TRUS) location system. Patient preparation and endo-rectal coil placement follows the standard procedure. Using the automatic spring loaded biopsy-needles targeted core biopsies was obtained. Additionally, standard extended-pattern 12-core biopsies (Sextants) were obtained in accordance with the NCCN guidelines. The core targets were separately labeled and processed.

Semantics and PI-RADS-version2 scoring

Semantic descriptors were derived from lesions targeted for UroNav Fusion biopsy. The semantics were marked for each modality (T2WI, DCE, DWI, ADC) independently on a point scale (1 to 5). A total of 24 semantic features were developed, of that 16 were used in this study. Specifically, these features described the location, size, shape, margin, intensity and extra-prostatic extension of the lesion, the organ volume, and the presence of either benign prostate hyperplasia or lymphadenopathy (detailed explanation in Table 2). Figure 1 shows example patient MRI with semantic scores, where 1a shows score for nodule/shape characteristics, oval nodule was scored as 1, irregular nodule was scored as 2, amorphous was scored as 3. Fig. 1b, shows example of semantic score on ADC images, where the nodule on left upper was hyper intense, right upper was iso-intense, left lower was hypo intense, and right lower was ‘marked hypo intense’. Figure 1c, shows contrast enhanced images, left (first panel) shows no early enhancement, received a score of 1, Followed by light enhancement (score = 2), moderate enhancement (score = 3), and obvious enhancement (score = 4) of the nodule indicated by an arrow, respectively.. The semantic features were systematically scored on a point scale (ranges from 2 levels up to 5) by the radiologists (Q.L. and H.L) and the PI-RADS version-2 (referred to as PI-RADS in this article) were independently evaluated following the guidelines of American College of Radiology (ACR). To assess the variability of the semantics among expert readers, we randomly selected 40 lesions (34 patients) and a third radiologist (J.C.) independently reviewed the scans and scored the semantics using the scoring sheet and point scale descriptors.

Table 2 Detailed semantic descriptors for prostate cancer a) broad categories b) feature description
Fig. 1
figure 1

Example of semantic scoring for prostate cancer (a) Nodule shape / border, where (1 = round/ oval, 2 = irregular, 3 = amorphous), Border (1 = well defined, 2 = everything else between 1 and 3, 3 = poorly defined), (b) ADC intensity (1 = hyperintensity, 2 = iso-intensity, 3 = hypointensity, 4 = marked hypointensity), (c) Nodule enhancement (1 = no enhancement, 2 = slight enhancement, 3 = moderate enhancement, 4 = obvious enhancement)

Statistical analysis

Agreement between the radiologists (Q.L. and J.C) was measured by the (weighted) Kappa index [25] for binary or ordinal variables. The kappa value was interpreted as follows: < 0, less than chance agreement; 0.01 to 0.2, slight agreement; 0.21 to 0.4, fair agreement; 0.41 to 0.6, moderate agreement; 0.61 to 0.8, substantial agreement; > 0.8, almost perfect agreement [26]. In our analysis, the radiologists scored 16 semantic features. Of these, 4 features had kappa value ≥0.7, 4 feature values were between kappa ≥0.6 and < 0.7, 4 features had kappa ≥0.5 and < 0.6, and 4 features could not be scored due to a limited range of the semantic characteristics (see Table 3).

Table 3 Reproducibility of Semantics features and PIRADS scored between two radiologists on randomly selected prostate patients with 40 targeted biopsies (32 unique patients). A) Actual scores b) sorted scores

We built a linear classifier model to find discriminant features that distinguish clinically significant cancers from indolent cases (GS ≥ 7 Vs GS ≤ 6), and indolent cases from benign (GS =6 Vs Benign). We selected the best 3 semantic features, taking all possible feature combinations ranked by Youden’s index [27, 28] for selecting highly predictive discriminators. The statistics were estimated following a cross validation approach (Hold out, 10 fold), randomly repeated over 100 times [29]. We also find the area under the receiver operator characteristics (AUROC) along with sensitivity, specificity, positive predictive value, and negative predictive value for the multivariable pairs of interest. The reported statistics were the ensemble value obtained over random repeats, and 95% confidence limits for the values reported.

Results

The final cohort used for the study had 167 biopsies (103 patients) with 57 biopsies that were considered clinically significant tumors (GS ≥ 7), 33 biopsies that were indolent tumors (GS ≤ 6), and 77 biopsies that were benign. Patient age ranged from 46 to 75 years at diagnosis. The PSA levels ranged from 0.8–44.7 ng/ml. Figure 2 shows distribution of semantic values in a box plot, with PI-RADS score plotted against pathological Gleason scores.

Fig. 2
figure 2

Box plot shows semantic traits across Gleason grades in the study cohorts. (a) PIRADS and T2 semantics trait, (b) ADC semantic trait and enhancement edge, (c) enhanced homogeneity and extra-prostatic extension

Among the semantic features, the Kappa scores for capsule status (presence or absence), homogeneity, shape, T2 intensity, ADC-intensity showed moderate agreement. Enhancement degree, extra-prostatic extension, enhanced homogeneity, and border were in substantial agreement between readers, while early enhancement and cyst (presence or absence) showed almost perfect agreement. The scores for four features, including seminal vesicle involvement, distal sphincter involvement, bladder neck involvement and lymphadenopathy, could not be computed, due to lack of examples.

We find ADC-intensity, homogeneity, and early enhancement to be univariate semantic predictors that gives the highest average AUROC (0.57 to 0.68). The PPV (positive predictive value) and sensitivity for these markers are relatively high, with average values to be [0.62 to 0.69] and [0.82 to 0.96] respectively, for finding the clinically significant prostate cancers (GS ≥ 7).

When these features were combined together, we found the combination of ADC-intensity, T2-Intensity, and enhancement homogeneity showed the highest average AUROC of 0.70, with average sensitivity and PPV for detecting the aggressive cancer to be 0.79 and 0.72, respectively. The next best feature combination was based on early ADC intensity, Border, enhancement homogeneity, that had an average AUROC of 0.71, with average sensitivity and PPV for detecting aggressive cancer to be 0.82 and 0.68, respectively. In comparison, we characterized predictors that discriminate aggressive from indolent prostate cancers using overall PI-RADS (version 2) scores. We repeated the predictive analysis with different level of cutoffs on the PI-RADS scores to discriminate aggressive cancer (i.e. PI-RADS≥5, PI-RADS≥4, PI-RADS≥3). We found that having a moderate cutoff (PI-RADS≥4) showed the highest AUROC of 0.6, with sensitivity and PPV of 0.98 and 0.69 respectively. Table 4 shows discriminant semantic features with their predictive statistics. We also find the top semantic predictors (ADC-intensity, T2-intensity, enhancement homogeneity) receiver operator characteristics was significantly different from PI-RADS3 (p = 0.0022), PI-RADS5 (p = 0.0048) based predictor of malignancy defined by Gleason score (GS ≥7). While semantics predictor was non-significant with PI-RADS4 (p = 0.0724) predictor, where significance was computed using non-parametric Delong’s statistics [30].

Table 4 Features based predictors that discriminate aggressive grade (Gleason ≥7 Vs ≤ 6) prostate cancers a) univariate semantic predictors b) multivariable semantic predictors (up to 3 semantics) c) PIRADS based predictor

We then built models to find semantic predictors to differentiate indolent (GS =6) from benign cases. We found extra-prostatic extension, early enhancement, ADC intensity features to be univariate discriminators, with an average AUROC of 0.58 to 0. 61. When combining these features together, the combination of ADC intensity, early enhancement and extra prostatic extension shows the highest average AUROC of 0.63 with an average sensitivity and PPV of 0.16 and 0.51 respectively. The next feature combination of homogeneity, early enhancement degree, and extra prostatic extension had an average AUROC of 0.63, with sensitivity and PPV of 0.20 and 0.57 respectively.

Receiver operator characteristics for top predictors is show in Fig. 3. Adding semantics to PI-RADS increases average AUC to 0.64 from 0.63 for GS6 vs Benign and lowers from 0.7 to 0.66 for GS 7 Vs GS6.

Fig. 3
figure 3

Receiver operator characteristic of semantic & PI-RAD based predictors (a) identify clinically significant grade prostate cancer (≥ GS7 from GS 6) and (b) Gleason 6 from benign

Discussion

In this study, we propose a radiological semantic scheme that captures traits on a point scale independently on different modalities of mpMRI. We used the semantic descriptor as a combination to build linear discriminant functions to identify clinically significant prostate cancers. These semantic predictors were then compared to PI-RADS-v2 based discriminators, the American College of Radiology had adopted the use of PI-RADS (version 2) system to report standardized prostate cancer findings in mpMRI [17, 31]. We found that semantics demonstrated better predictability of pathological outcome compared to PI-RADS based predictors. We believe semantic traits may help reduce the variability in image interpretation between radiologists as the observational scorings are made for a trait, independently in a modality (T2w, ADC, DCE). Semantics scoring is specifically defined to obtain an expert opinion about a radiological trait, such as the presence or absence of a trait, or the multi-level appearance of a trait in the scan.

In a recent report PI-RADS 1 and 2 scoring schemes were compared and report a PPV of 75% for both versions to find clinically significant cancers. The NPV (negative predictive value) was 46% for PI-RADS-1 and 43% for PI-RADS-2, in a cohort of 66 patients [32].

Tissue cell densities have been well characterized and is reflective of molecular movement, in prostate carcinoma it is, characterized by reduced ADC values [33]. Further, ADC value in prostate has been shown to be related to Gleason score showing an inverse trend [34, 35]. It is useful in differentiating carcinoma from benign hyperplasia [36], high-risk patients from those at low and intermediate risk [37] and helpful for transitional zone (TZ) lesion detection [38]. We also find that ADC is a critical marker in identifying clinical significant cancer and are capabale of distinguishing indolent from benign cases. Due to interpretational variability of dynamic contrast enhancement images, they do not contribute to the overall clinical assessment of prostate lesions, especially in PIRADS-v2 (exception of PI-RADS score of 3). While in our study, early enhancement and enhancement degree were effective predictors, and the cancerous nodule usually presents early enhancement and higher enhancement degree. When combined with ADC intensity and extra-prostatic extension, they form better predictors of clinically significant cancers.

. Clinically, any non-binary point scale can lead to some level of unnecessary confusion to practitioners, and eventually leading to variability in diagnosis that will impact the patient care [39]. In our study, we used discriminator functions and formed different multivariable models agnostically combing traits across modalities, with each trait having equal likelihood to be part of the predictor model. We limit the size of the predictors to three semantic traits due to a limited sample size. This approach allows combination of information across modalities to find clinically significant prostate cancers.

We find PI-RADS based predictor with a cutoff of ≥4 showed slightly lower discriminatory ability to find clinically significant cancers (AUROC of 0.6, sensitivity and PPV of 0.98 and 0.68 respectively), compare to its ability to differentiate indolent from benign (AUROC of 0.62, PPV of 0.38 and Sensitivity of 0.77). We find semantics based predictors shows better performance, with an AUROC of 0.70 and 0.63 for discriminating clinically significant versus indolent tumor and indolent tumor versus benign, respectively (see Table 4 & 5). We also find adding semantics to PI-RADS (overall score) shows improvement in predictor performance, both in discriminating clinically significant lesion (GS ≥ 7) from indolent (GS =6) and benign from indolent (GS =6).

Table 5 Features based predictors that discriminate indolent grade cancer (Gleason = 6 Vs Benign) from benign a) univariate semantic predictors b) multivariable semantic predictors (up to 3 semantics) c) PIRADS based predictor

There is a high level of subjectivity among radiologists in scoring PI-RADS (v.2) [40], in a recent review, these shortcomings were categorized into clinical indications and technical/physiological artifacts [41, 42]. The clinical consequence in disease identification has resulted in impacting patient care by over-detection in some cases and missed diagnosis of aggressive cancer in others. We believe evaluation of semantic traits in mpMRI images will reduce subjectivity in tumor detection.

In our study, trained radiologists were asked to describe observed traits on a point scale following the semantic descriptors and these are then related to pathological outcome. The use of semantic discriminant functions may provide an alternative real value risk score to the oncologist to decide upon an appropriate management plan for the patient. We understand that there is a further need to train such predictors on a larger cohort to obtain balanced coefficients based on the radiological traits.

We believe semantic predictors can discriminate clinically significant cancers and provide valuable risk assessment to aid clinical decisions both in targeting lesions and planning treatment for the disease.

Limitations

  • We have assembled over 103 patients (167 biopsies) all of the data was obtained in a single institution with diverse cohort and used to train the model in cross validation setting. The data in our center were obtained from couple of clinical locations and biopsies carried out by multiple urologists. Data from multi-institutions will improve diversity of the cohort.

  • This approach will have a better possibility of obtaining a stable model with independent test and validation cohort. We acknowledges the absence of such a dataset.

  • We used the lesions on mpMRI scan to make semantic assessment and pathological validation was obtained by TRUS/MPI biopsy. It’s possible that core lesion may have been missed leading tumor, leading to lower gleason grade, consequentially reduce classifier performance.

Conclusions

The proposed radiological semantic schema to describe prostate lesions on mpMPI shows promise in quantifying tumor imaging traits. A model based approach of these traits provides a computational means to relate these findings to pathological outcome. These methods show potential in discriminating prostate cancer lesions with better accuracy than currently practiced risk assessment.