Supervised machine learning enables non-invasive lesion characterization in primary prostate cancer with [68Ga]Ga-PSMA-11 PET/MRI

Purpose Risk classification of primary prostate cancer in clinical routine is mainly based on prostate-specific antigen (PSA) levels, Gleason scores from biopsy samples, and tumor-nodes-metastasis (TNM) staging. This study aimed to investigate the diagnostic performance of positron emission tomography/magnetic resonance imaging (PET/MRI) in vivo models for predicting low-vs-high lesion risk (LH) as well as biochemical recurrence (BCR) and overall patient risk (OPR) with machine learning. Methods Fifty-two patients who underwent multi-parametric dual-tracer [18F]FMC and [68Ga]Ga-PSMA-11 PET/MRI as well as radical prostatectomy between 2014 and 2015 were included as part of a single-center pilot to a randomized prospective trial (NCT02659527). Radiomics in combination with ensemble machine learning was applied including the [68Ga]Ga-PSMA-11 PET, the apparent diffusion coefficient, and the transverse relaxation time-weighted MRI scans of each patient to establish a low-vs-high risk lesion prediction model (MLH). Furthermore, MBCR and MOPR predictive model schemes were built by combining MLH, PSA, and clinical stage values of patients. Performance evaluation of the established models was performed with 1000-fold Monte Carlo (MC) cross-validation. Results were additionally compared to conventional [68Ga]Ga-PSMA-11 standardized uptake value (SUV) analyses. Results The area under the receiver operator characteristic curve (AUC) of the MLH model (0.86) was higher than the AUC of the [68Ga]Ga-PSMA-11 SUVmax analysis (0.80). MC cross-validation revealed 89% and 91% accuracies with 0.90 and 0.94 AUCs for the MBCR and MOPR models respectively, while standard routine analysis based on PSA, biopsy Gleason score, and TNM staging resulted in 69% and 70% accuracies to predict BCR and OPR respectively. Conclusion Our results demonstrate the potential to enhance risk classification in primary prostate cancer patients built on PET/MRI radiomics and machine learning without biopsy sampling. Supplementary Information The online version contains supplementary material available at 10.1007/s00259-020-05140-y.


Introduction
Prostate cancer is the second most common cancer in men worldwide, with 1.3 million new cases diagnosed in 2018 [1,2]. The worldwide incidence rates significantly increased during the last decade, most likely due to the wider application of prostate-specific antigen (PSA) screening [2]. While the 10year survival rate of prostate cancer is approximately 90%, advanced or late-stage prostate cancer may be life-threatening, in particular, in metastasized stages of the disease [3].
The 5-year risk stratification in patients with primary prostate cancer is mainly built on clinical stage, PSA, and Gleason scores, derived from invasive biopsy samples [4]. Despite having profound effects on treatment planning and, thus, patient's quality of life, this approach has a number of limitations [3,5]. First, Gleason scoring relies on biopsy sampling, hence, can neither help assess the entire prostate nor fully characterize the heterogeneity of any pertinent tumor [6]. In addition, transrectal biopsy sampling has been associated with side-effects, such as haematospermia or haematuria [3]. Second, previously published risk classification systems were reported to have the tendency of incorrectly grading primary prostate cancer [3]. In patients with a high risk score and absent metastatic disease, radical prostatectomy is the treatment-of-choice [7] despite the risk of potential overtreatment [8] and at the same time, a 20-40% chance of biochemical recurrence (BCR) [9,10].
Combined positron emission tomography/computed tomography (PET/CT) or PET/magnetic resonance imaging (PET/MRI) using radiotracers targeting prostate-specific membrane antigen (PSMA) can help to localize suspicious lesions in the prostate [11,12]. PSMA-PET in combination with CT has been reported to improve primary tumor localization [13] and the diagnosis of recurrent prostate cancer [14,15] in patients after radical prostatectomy even at low PSA levels [16]. In contrast, PSMA-PET/MRI was shown to support the diagnosis of intermediate and high-risk patients as well as to detect tumor recurrence [13]. Nevertheless, the diagnosis of primary prostate cancer is still based on core-needle biopsy, with non-invasive imaging playing a role in the visual identification of lesions and/or in image-guidance for biopsy sampling [17,18].
Recently, radiomics have been argued to add value to the diagnostic pathways and patient management [19]. Various studies have been investigating the correlation of PSMA expression and clinical end-points in prostate cancer patients [14,15]. Furthermore, radiomics combined with machine learning in MRI [20,21] as well as in PET/CT [22][23][24] demonstrated the potential feasibility to establish novel in vivo prediction models for prostate cancer risk assessment.
In light of the potential of combining PET/MR imaging, radiomics and machine learning (ML), the objectives of this study were as follows: (a) to establish and cross-validate prostate lesion low-vs-high risk in vivo ML predictive models built on PET/MRI radiomics, (b) to establish and validate biochemical recurrence and overall patient risk (OPR) models that utilize in vivo ML scores instead of biopsy grades together with PSA and clinical stage, and (c) to compare the above patient risk models to the standard risk stratification.

Patient data
Patients were selected from the database (n = 122) of a monocentric pilot study to a prospective randomized trial (clinicaltrials.gov NCT02659527) conducted between 2014 and 2015. Fifty-two of the 122 patients underwent surgery; in these patients, PET/MRI, PSA values, pre-operative biopsy results, and post-operative whole-mount histopathology were documented [15] (Table 1). All the 52 patients underwent a dual-tracer, fully integrated PET/MRI scan ([ 18 F]FMC and [ 68 Ga]Ga-PSMA-11 sequentially). This study, however, only included the [ 68 Ga]Ga-PSMA-11 PET image as well as the transverse relaxation time-weighted (T2w) and apparent diffusion coefficient (ADC) MRI sequences in the analysis (Supplement: Table 1). All patients were treated with radical prostatectomy according to guideline recommendations [3]. All surgical specimens were processed according to the institution's standard pathologic procedures in whole mount sections. Staging and grading were performed according to the UICC TNM classification and WHO/ISUP 2005 system, respectively [25]. The study was approved by the local institutional ethical committee and patients provided their written informed consent. See Fig. 1 for the CONSORT study diagram.

Delineation
Delineation and annotation of prostate lesions on PET/MR images were performed using the Hybrid 3D software ver. 4.0.0 (Hermes Medical Solutions, Stockholm, Sweden). Here, [ 68 Ga]Ga-PSMA-11 PET and T2w as well as ADC MR images were viewed side-by-side with the annotated, whole-mount histopathological slices. Delineation was done over the [ 68 Ga]Ga-PSMA-11 image using standard three-dimensional iso-count VOIs (Fig. 2). The initial lesion delineations were cross-examined and corrected manually-if required-as part of an independent review process performed by PET and MRI specialists. This step resulted in 121 lesions in total. An additional reference region was defined in the gluteus muscle to normalize the standard uptake value (SUV) of [ 68 Ga]Ga-PSMA-11 and the T2w arbitrary voxel values to the mean of their respective reference background (26).

Feature extraction
Each image was resampled to 2.0 × 2.0 × 2.0 uniform voxel resolution via ordinary Kriging interpolation [27,28]. Radiomic features with "very strong" or "strong" consensus values as of the Imaging Biomarker Standardization Initiative (IBSI) guidelines were extracted from the 121 resampled [ 68 Ga]Ga-PSMA-11, T2w and ADC lesions by the MUW Radiomics Engine (ver. 2.0) that was validated based on IBSI standards [29] (Supplement Table 1). Conventional standardized uptake values including SUX max , SUV peak , SUV mean , and SUV TLG were merged with the extracted 442 radiomic features to compose a 446 long feature vector for each lesion. While total lesion glycolysis (TLG) is originally proposed for [ 18 F]FDG, it was involved in our analysis as it characterized [ 68 Ga]Ga-PSMA-11 accumulation in prostate lesions.

Feature redundancy reduction
Feature redundancy ranking and reduction were done across the 446 features by covariance matrix analysis [19] where features were considered redundant with higher than 0.75 absolute Pearson correlation coefficient. This step resulted in keeping 80 features for further analysis.

Reference standard
The respective whole-mount histopathology patterns of each delineated lesion were dichotomized as low (≤ Gleason 3, prostatic intraepithelial neoplasia (PIN), prostatitis, benign prostatic hyperplasia (BPH)) and high (> = Gleason 4) risk respectively. Furthermore, BCR and OPR reference values were established for each patient. BCR was defined when two consecutive PSA rose above 0.2 ng/ml. Follow-up was generally every 3 months for the first 2 years, then semiannually until the fifth year, then annually. Mean follow-up was 41 months. OPR was defined high, if BCR was positive or the node-stage (clinical or pathological) or the metastases-stage (clinical or pathological) were positive.

Statistical analysis in [ 68 Ga]Ga-PSMA-11
Area under the receiver operator characteristic curve (AUC) was calculated for conventional SUVs and the volume of each delineated lesion in the [ 68 Ga]Ga-PSMA-11 image to estimate the performance of predicting low-vs-high lesion risk. This process included SUX max , SUV peak , SUV TLG , and lesion volume values.

Cross-validation scheme
Monte Carlo (MC) cross-validation scheme was utilized to randomly assign training and validation roles to the 52 patients 1000-times. In each fold, five patients were selected for the validation role, while the remaining patients got the training role. This step was necessary to avoid mixing lesions for training and validation from the same patient. No repetitions were allowed during the generation of MC folds; thus, each of the 1000-fold configurations with their trainingvalidation selections was unique.

Machine learning scheme
Mixed ensemble learning scheme built on random forest classifiers (RF) was utilized to build models for predicting lesion LH, patient BCR as well as OPR (models denoted as M LH , M BCR and M OPR respectively) [26,30,31]. Nine RFs   Table 2). The final prediction was provided by majority vote of the respective nine RFs. This approach was chosen to minimize hyperparameter bias and to increase predictive performance [32]. Furthermore, the average predictive score of the nine RFs represented a continuous value range between 0.0 and 1.0 reflecting on the prediction certainty of the mixed ensemble. Therefore, this value could be the subject of AUC analysis across MC folds.

Lesion low-vs-high risk prediction
Training and validation lesion sets were generated as of the pre-generated MC scheme roles to train and validate the M LH models in each MC fold. In order to keep model complexity minimal and to reduce the chance of overfitting, selection of the top five-ranking features was performed by R-squared ranking in the training dataset prior to establishing the M LH lesion model per fold [33]. The same five features were then selected from the respective validation standardized uptake value (SUV) and volume area under the receiver operator characteristics curve (AUC) analysis. Monte Carlo (MC) crossvalidation scheme was utilized to generate patient training and validation sets 1000-times. This MC scheme was utilized to build lesion low-vshigh (LH) prediction models via machine learning (M LH ). Biochemical recurrence (BCR, n = 36) and overall patient risk (OPR, n = 50) patient prediction models were built across the same MC folds (M BCR and M OPR respectively). All machine learning models underwent confusion matrix analytics, sham data analysis, and AUC analysis across MC folds. BCR and OPR were also predicted by standard D'Amico score dataset to evaluate. Validation model performance was estimated via confusion matrix analytics across the predictions of the validation cases of the MC folds [26]. The M LH scheme also underwent AUC analysis by evaluating the predictive performance of its averaged nine RF vote across the MC validation cases. Last, to estimate the effect of sham data in the M LH model, confusion matrix analytics were also performed over randomly permutated labels across all MC folds [24,34].

Feature weighting
The importance of each feature in predicting lesion low-vshigh risk was determined by counting the occurrence of all selected features across the MC folds by the R-squared ranking approach.

Patient biochemical recurrence and overall risk prediction
Patient risk models for predicting BCR and OPR were established (M BCR and M OPR respectively) analyzing the PSA, the enumerated clinical stage (Supplemental Table 3), and a composite M LH score (CLH) per patient calculated by eq. 1.
where k is the number of lesions in the given patient, M LH (i) is the predicted low-vs-high risk score of lesion i provided by the M LH model of the given fold, v i is the volume of lesion i, and V ¼ ∑ k i¼1 v i is the sum of lesion volumes in the given patient.
Training and validation patient sets containing the above value triplets were generated as of the pre-generated MC scheme roles to train and validate the M BCR and M OPR models in each MC fold. In case a patient with validation role in the given fold had no BCR or OPR reference value available, it was excluded from the respective cross-validation of the given patient model.
To handle class imbalance, the training set underwent class imbalance correction by synthetic minority oversampling technique (SMOTE) [24,35] for both the M BCR and M OPR training independently. Confusion matrix analytics were calculated across the validation set of all MC folds of the M BCR and M OPR model schemes. The same process was repeated by reference label permutations across the MC folds to estimate the effect of sham data. Both the

Lesion low-vs-high risk prediction
The M LH model validation performance as per the MC crossvalidation scheme yielded 71% sensitivity, 90% specificity, 88% positive predictive value, 75% negative predictive value, 81% accuracy, and 0.86 AUC. Sham data analysis revealed 0.52 AUC for permutated labels in the M LH model.

Feature weighting and distribution
Overall seven features were identified as selected across the 1000 MC folds via the R-squared ranking method. Features that were always selected were coefficient of variation and gray level co-occurrence matrix (GLCM) information correlation type 1 from the [ 68 Ga]Ga-PSMA-11 image (n = 1000).

Patient biochemical recurrence and overall risk prediction
The cross-validation performance revealed an average validation accuracy of 89% and 91% as well as AUC of 0.90 and 0.94 for the M BCR and M OPR patient models respectively. The M OPR model outperformed the M BCR model with 94% specificity, 93% positive predictive value, and with 87% sensitivity. The performance of M OPR and M BCR with sham data revealed 0.54 and 0.56 AUC respectively. See Fig. 5 for the detailed performance values of the M BCR and M OPR models.

Discussion
In this study, we investigated the feasibility of predicting prostate lesion-specific low-vs-high risk built on PET/MRI radiomics and patient-specific biochemical recurrence as well as overall patient risk. We demonstrated excellent cross-validation performances for M LH (AUC 0.86) as well as for M BCR (AUC 0.90) and M OPR (AUC 0.94). Based on the above approaches and our achieved model performances, we consider that our findings have important clinical implications in the field of primary prostate cancer risk assessment as they point towards the feasibility to estimate lesion and patient risks in vivo. Next to establishing the above models with radiomics and machine learning, conventional [ 68 Ga]Ga-PSMA-11 SUV and volume analysis were also conducted. This analysis revealed that SUV max had the highest predictive power (AUC 0.80) to classify low-vs-high prostate lesions followed by SUV peak , and SUV TLG , while lesion volume had no significant predictive power (AUC 0.53). These findings are in line with previous analyses performed in PET/CT [24].
Feature ranking across our Monte Carlo folds demonstrated that [ 68 Ga]Ga-PSMA-11 is the most important in vivo feature source to establish lesion risk prediction models compared to ADC and T2w MRI features. The highest-ranking [ 68 Ga]Ga-PSMA-11 features were either simple statistical values such as the coefficient of variation and SUV max or simple secondorder textural ones such as information correlation from the GLCM feature category. Information correlation is a firstorder GLCM feature reflecting on the information content (a.k.a. entropy) of voxel neighborhood connectivity occurrences; thus, it is a basic heterogeneity descriptor. This feature was previously also identified as highly robust across various PET imaging centers [36]. The feature ranking across MC folds identified SUV peak , SUV TLG , and volume as lowranking; however, SUV max was among the highest ranking ones. While the potential of PSMA SUV max in characterizing prostate cancer had been presented [37,38], Cysouw et al. concluded in a recent study that prostate risk in PSMA can be better characterized by textural parameters compared to SUVmax [24]. They utilized [ 18 F]-DCFPyL PET/CT and reported 0.81 AUC to differentiate high (GS > = 8) and low-risk prostate cases. Our findings on the other hand demonstrate that conventional SUV parameters in combination with simple textural features can yield high-performing models in [ 68 Ga]Ga-PSMA-11 PET/MRI to characterize prostate risk.
While no T2w feature was selected as high-ranking, ADC interquartile range (also referred to as "robust" value range) was selected as high-ranking. Prior studies focusing on ADC analysis to predict prostate lesion risk consistently identified ADC min , ADC mean as well as ADC median [20,39] as highly predictive (AUC range 0.72-0.90). We consider that the above findings and ours describe the same phenomenon, namely, the strong predictive ability of simple ADC values without the need of incorporating second or higher-order radiomic features in the analysis. The above findings in prior reports demonstrate the predictive performance of PSMA PET and ADC MR images individually. Hence, we hypothesize that the high performance of our M LH model is due to the fact that it combines both [ 68 Ga]Ga-PSMA-11 PET and ADC MRI features in one model scheme.
Further to the above findings, we also established patient biochemical recurrence (M BCR ) and overall patient risk (M OPR ) models. In order to provide an in vivo score per patient in lieu of biopsy grades in these models, we created a CLH score which weighted each M LH score per lesion with its respective volume in each patient. Since volume was identified as non-predictive to classify low-vs-high risk in prostate lesions (AUC 0.53), we assumed that the volume effect [40] in our high-ranking features was negligible, and thus, lesion volume was an independent value from our lesion M LH scores. This assumption allowed us to utilize volume as a weight factor for each lesion M LH score to compose the patientspecific CLH score. The resulted CLH score in combination with PSA and clinical stage values resulted in highperforming M BCR and M OPR models (0.90 and 0.94 crossvalidation AUCs respectively). We assume that the accuracy performance increase of + 20% and + 21% in our M BCR and M OPR models compared to standard risk estimation are due to the following reasons: first, the clinical standard utilizes Gleason patterns from biopsy to describe lesion pattern risks in the prostate [41]. Biopsy is considered imperfect as it may not be able to describe the overall heterogeneity of the prostate lesions [19,42]. In contrast, our CLH score could characterize whole prostate lesions in vivo. Second, the clinical standard categorizes the PSA, the Gleason, and the clinical stage values independently into three categories (low, medium, and high risk). In contrast, we incorporated PSA, clinical stage, and the CLH score without re-binning them, and thus, avoiding potential information loss. Third, the clinical standard score acts as a maximum filter across its pre-binned risk categories to estimate overall risk to the patient. In contrast, the random forest ensemble logics in our M BCR and M OPR models could describe more complex relationships among PSA, clinical stage, and our in vivo CLH score. Our results demonstrate that such relationships may be indeed present and that building on those relationships may lead to in vivo risk predictive models in prostate cancer patients with the potential to eliminate the need of biopsy sampling in the future.
This study had a number of limitations. First, it built on a single-center cohort; however, due to utilizing a pre-generated MC fold scheme for all training and validation processes, no training and validation samples were mixed in between the lesion and patient predictors. In addition, the utilized data preparation (redundancy reduction, feature ranking, and class imbalance correction) as well as training (mixed ensemble) and validation (1000-fold CV, sham data analysis) approaches minimized the chances of false discoveries. Second, due to the dual-tracer study design from which our images were taken, the [ 68 Ga]Ga-PSMA-11 scans were not entirely exempt of [ 18 F]FMC uptake remnants. Nevertheless, [ 18 F]FMC can be regarded an irreversible tracer [43] and, thus, the [ 18 F]FMC uptake in terms of tissue to lesion ratio is expected not to change until the [ 68 Ga]Ga-PSMA-11 examination. Last, only patients with proven prostate cancer were included after radical prostatectomy. Nevertheless, this selection criterion was necessary to acquire stable ground truth for lesion labeling.

Conclusions
This study demonstrates the feasibility of [ 68 Ga]Ga-PSMA-11 PET/MRI in combination with radiomics and machine learning to non-invasively deliver both lesion characterization and risk prediction equally to preoperative invasive biopsy in patients with primary prostate cancer. Prospective multicentric studies are required to investigate the reproducibility and clinical utility of this approach. Data availability The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethics approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The study was approved by the Ethik Kommission der Medizinischen Universität Wien (EK 1985/2014).
Consent to participate Informed consent was obtained from all individual participants included in the study.

Consent for publication NA
Code availability Available from the corresponding author on reasonable request.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.