Background

Prostate cancer (PCa) is a common cancer in Europe and all over the world, with around 6600 new cases diagnosed each year [1]. Radical prostatectomy is widely recognised as the standard surgical treatment for early-stage PCa. The detection of extracapsular extension (ECE) is fundamental for planning the surgical approach because it can lead to high rates of positive surgical margins, recurrence, and decreased survival [2,3,4,5,6,7]. Nomograms, such as D’Amico or CAPRA, are often used to predict the risk of advanced disease [8, 9]. Magnetic resonance imaging (MRI) has been shown to improve accuracy in predicting ECE, but there is high inter-reader variability related to the semantic features interpretation on MRI which is not consensual among the authors [10,11,12]. The high-quality MRI acquisition protocol and the high experience of the readers could help to improve the accuracy of MRI [10,11,12].

Radiomics can help extract different features from medical images using data-characterisation algorithms, improving diagnostic performance in PCa, as well as the reproducibility of the MRI examinations. Artificial intelligence (AI) and machine learning (ML) can help apply radiomics in everyday practice. However, clinically accepted and validated algorithms have not been established [13,14,15].

This systematic review aims to summarise evidence on using radiomics algorithms to predict pathological extracapsular extension (pECE) in PCa patients to aid surgical planning and improve outcomes.

Methods

This systematic review follows the guidelines for Preferred Reporting Items for Systematic Reviews and Meta-analyses and the protocol was registered with PROSPERO (CRD42020215671) and published in BMJ Open [16].

Eligibility criteria

This article reviewed manuscripts involving adult PCa patients who had a presurgical prostate biopsy indicating a Gleason score equal to or greater than 6 and underwent MRI before their surgery. Only studies using 1.5-T or 3-T MRI scanners and no prior treatment were included.

The primary outcome was pathologic local staging after surgery, with the goal of identifying imaging and clinical predictors of extracapsular extension on pathology specimens (pECE). The eligible studies were required to be retrospective or prospective cohort studies or randomised controlled trials that included prognostic factor analysis. Furthermore, these studies needed to have been published in peer-reviewed journals.

Studies were included if:

  • Information regarding PCa MRI staging and pathological PCa staging was available in the published report.

  • MRI images and radiomics signatures were used to detect pECE after prostatectomy.

Studies were excluded if:

  • The AI/ML predictive models were built with another main predictive endpoint, such as localisation, segmentation, recurrence, prognosis, or lymph node metastasis, and characterisation of PCa without reference to the pathologic PCa staging endpoint. The paper was included in the analysis if the authors built different models with different endpoints but included the pathologic PCa staging endpoint.

  • Studies with only MRI image characteristics, interpretative MRI semantic features, or combine feature for any signature without combining the radiomics feature extraction signature, were excluded.

  • Cross-sectional studies, case series, case reports, case-control studies, systematic reviews, conference proceedings, and master’s or PhD theses were excluded.

Search strategy

We conducted a comprehensive search across six electronic databases, namely CINAHL, EMBASE, CENTRAL (Cochrane Central Register of Controlled Trials via Wiley Online Library), PubMed, Web of Science Core Collection, and for grey literature, OpenGrey and Grey Literature Network Service. Furthermore, we manually searched through the reference lists of all included studies and previously published systematic reviews of MRI staging of PCa.

The search strategy was developed by a medical librarian with expertise in systematic reviews. The search terms were customised to the specific requirements of each database. Keywords (“Prostate neoplasm”, “Machine learning”, “Artificial intelligence” “Radiomics”, “Deep Learning”, “Staging” and “Magnetic Resonance Imaging”) or subject headings specific to each database (e.g., MeSH) were used along with Boolean operators ‘OR’ and ‘AND’ to combine the search terms effectively. The search strategy is detailed in the published protocol [16].

Study selection

There were no restrictions applied, and only studies published in the English language were included. The search was conducted in each database from January 2007 to October 2023.

Following title/abstract screening, the full texts of potentially relevant studies were evaluated. In cases where a consensus was not reached between the two reviewers (A.G. and H.W.), a third reviewer (M.K.) was consulted. Additionally, the reference lists of the studies chosen for inclusion were examined for any other relevant studies. The data collection process is illustrated in Fig. 1.

Fig. 1
figure 1

PRISMA flow diagram of the study selection

Data collection process

Data extraction: Two reviewers (A.G. and M.K.) independently extracted the following data from the included studies. In cases of disagreement between the two reviewers, a consensus was reached through discussion. If necessary, two additional expert reviewers (M.O. and H.W.) were consulted. The extracted data were broadly categorised into patient and study characteristics, radiologist details, type of feature extraction (agnostic if extracted by computation algorithms), semantic (interpreted by a radiologist), model characteristics, and predictive performance. Sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) were extracted in the training and validation groups, with 95% confidence intervals where available. The radiomics and integrated models were compared, and the best predictive performances were registered.

Risk of bias applicability

The risk of bias in individual studies was assessed by three reviewers (A.G., H.W., and M.K.). Since we included diverse types of studies, we used different tools to assess the risk of bias depending on the characteristics of the studies. Data from these studies were extracted, tabulated, and then reviewed for risk of bias and applicability using the Quality Assessment of Diagnosis Accuracy Studies version 2 (QUADAS-2) tool [17]. This tool covers four sources of bias: (1) patient selection, (2) index test, (3) standard domain, and (4) flow and timing bias. For each one, the risk of bias was assessed as high risk, unclear risk, or low risk, depending on the information offered by the study. The review authors used the signalling QUADAS-2 question information to judge the risk of bias. If all signalling questions for a domain were answered ‘yes’ then the risk of bias was considered ‘low’. If any signalling question was answered ‘no’ this flagged the potential for bias. The ‘unclear’ category was used only when insufficient data were reported to permit a judgement.

Because QUADAS-2 sometimes does not accommodate the niche of terminology encountered in AI studies, we also added a radiomics quality score (RQS) proposed by Lambin et al to this systematic review [17, 18]. Studies with a high risk of bias and low applicability were excluded. A narrative synthesis was conducted, acknowledging the risk of bias and the strength and consistency of significant associations.

Synthesis of results

Due to differences in AI system applications, study designs, algorithms, patient cohorts, evaluation strategies, and performance metrics, narrative synthesis was chosen instead of meta-analysis. Meta-analysis could be not recommended for studies of diagnostic test accuracy that have significant differences in patient cohorts and test settings, as it would produce biased results.

Results

Studies characteristics (Table 1)

The eleven final included studies (corresponding to 0.009% from a total of 1247 screened papers), were published between 2019 and 2023, used a retrospective study design and were mainly (8) from China, two from Italy, and one from Norway. All the studies described a model based on radiomics extracted features, either alone [19,20,21,22], combined with clinical features [23,24,25,26,27], or in combination (integrated model with semantic interpretative features, plus agnostic radiomics features associated with clinical features to predict ECE in histopathological specimen analysis [28, 29]. All but three studies used a 3 T field strength and the total number of patients included in the models ranged between 62 to 284. Only two studies [23, 30]were performed in more than one institution. The lesion segmentation and feature interpretation were undertaken by more than one radiologist, and the inter-agreement ratio was evaluated in all studies, except Losnegård et al [28]. A recent study [29] compared three individual models: radiomics, clinical and the assessment of ECE on MRI done by four radiologists: semantic model) and compared them to a combined model with all relevant features from the three models. Three studies also referred to other endpoints, such as the positivity of surgical margins [24, 26], lymph node metastases and tumour aggressivity [19, 26]. One study also built a radiomics model to predict PCa prognostic biological biomarkers [24].

Table 1 General characteristics of the studies

Radiomics characteristics (Table 2 and Table-S1)

Model performance

All studies reviewed focus on developing radiomics models based on agnostic features extracted from T2WI (T2-weighted imaging) and ADC (apparent diffusion coefficient) of the manual segmented tumoural region. DCE images were also used in four publications [20, 21, 24, 28]. They compared different signatures composed by IFs (imaging features) extracted from T2WI and ADC maps independently and from the two modalities, and they tested them for prediction presence vs. absence of pECE.

Table 2 Radiomics characteristics

For each MRI sequence, shape features (size and sphericity) and texture features (GLCM, GLRLM, GLSZM, NGLDM) were the most common features with discriminative importance and in the majority of cases were for those extracted from T2WI. The exception was Fan et al [24], where the most relevant feature was extracted from DCE. The radiomics features derived from histograms were not so relevant as the features previously mentioned. The coefficients for the calculation of the selected radiomics features were different between the studies and the authors did not find a common stable radiomics feature which could be the dominant impact factor for pECE across them. The image processing and feature selection methods were very heterogenous between the studies. Matlab, original Pyradiomics and Laplacian of Gaussian (LoG) and wavelet-filters were used for images extraction. Most researchers compared radiomics with clinical and combined models (radiomics + clinical features). In these cases, the combined models achieved the best performance (AUC: 0.92) [25], 0.72 [26], 0.95 [24], 0.72 [23], 0.76 [27] and 0.89 [29].

The highest AUC of a model using only radiomics features (tumoural region) to predict ECE was 0.93 for the training group and 0.85, for the test validation [24]. This was followed by Xu et al [25] (AUC 0.91), Yang et al [29] (AUC 0.86) and Cuocolo et al [22] (AUC 0.83 and 0.80/0.73, in training set and two external validation sets, respectively).

Ma et al [20, 21] built a radiomics signature in the peri-tumoural region (capsule and periprostatic fat) and compared it with the radiologist’s interpretation. Pairwise comparisons showed that the radiomics signature was more accurate than the radiologist’s interpretation. The accuracies (90% and 88%, respectively, in the training and validation groups) were much higher than that achieved directly by the radiologists (AUCs 0.685–0.755 in the training cohort and 0.600–0.697 in the validation cohort). This study is aligned with Yang et al study [29], where the radiomics signature is superior to radiologist interpretation (AUC 0.88 and 0.835, training and validations groups vs. AUC 0.746 and 0.774 training and validations groups) respectively.

Bai et al compare intra and peri-tumoural (PT) single radiomics signatures and achieve the best predictive value AUC: 0.70., in the PT region extracted from the ADC map. In this study the PT were automatically derived through 3D dilatation and was extracted from ADC map.

Only two authors [28, 29] built a combined model with clinical and semantic interpretative MRI features using Mehralivand’s proposed EPE-grade criteria [31]. They compared the radiologist’s interpretation and the Memorial Sloan Kettering Cancer Center (MSKCC) nomogram with the radiomics signature and the combined models. The AUC of the radiologist’s interpretative model was similar in both studies in the training group (AUC of 0.74) [28, 29] and 0.77 in the validation group in the study of Yang et al [29]. In relation to radiomics model, the AUC was in both studies, AUC 0.75 [28] and 0.88 [29], respectively.

The combination of radiomics, radiology plus clinical interpretation performed statistically better (AUC 0.89; p < 0.05) than clinical model (AUC 0.74) and semantic model (AUC 0.77) but not statistically significantly (p-value 0.167) than the radiomics alone (AUC 0.835) [29].

Risk of bias assessment (Table 3)

The review authors used the QUADAS-2 and RQS methods to judge the risk of bias.

Table 3 Risk of bias: QUADAS-2 and radiomics quality scores

QUADAS-2

Patient selection: Only Damascelli et al [19] study was deemed to have a high risk of bias as the case selection process was unclear due to insufficient description.

The cases of Cuocolo et al [22], Losnegård et al [28] and Yang Liu [27] were considered unclear because they did not mention whether they excluded patients with any type of treatment before radical prostatectomy.

Index test (MRI Images): All patients underwent adequate and identical institutional MRI protocol. Manual segmentation of the lesions was reproducible in all the studies except one [28], in which only one radiologist undertook the lesions’ segmentation, and one study [26] did not exclude poor-quality images.

Reference standard: The risk of bias for the reference standard (presence of ECE in the specimen) was low in all of the included studies.

Flow and timing: Except for four studies [19, 23, 24, 27], which did not mention the time between MRI and prostatectomy, all the included studies were consistent in using appropriate reference standards, for patients and maintaining appropriate intervals between MRI and obtaining histopathology.

Radiomics quality score

Cuocolo et al [22] study had the maximum RQS of 20 points, and Damascelli et al [19], Losnegård et al [28] and Yang Liu had the lowest scores of 10, 9 and 8 points, respectively. The main reason for this was the absence of model validation. No study was prospective or presented a phantom study on all scanners or imaging analyses at multiple time points. Only one study demonstrated open science data [22]. A cost-effectiveness analysis or biological correlation was not performed in all the studies.

Discussion

This systematic review found ten studies that aimed to predict pECE in PCa patients using radiomics signatures. Most of these studies had limited sample sizes and used data from a single centre, and four used a single MR scanner, which restricts the generalisability of their models. All the models utilised textural feature extraction, but the most significant textural features varied among the studies. The majority of the significant features were extracted from T2WI.

The Damascelli et al study being referred to here has a high risk of bias, did not perform external validation, and its patient sample size of only 62 patients was not considered adequate for robust conclusions [19]. Cuocolo et al achieved an accuracy of 83% in the training group using only ROIs of intraprostatic lesions to predict pECE [22].

Ma et al did two complementary studies comparing the radiomics model built from the first study [20] with a semantic interpretative model MRI EPE grade, in the second study [21]. They found that the radiomics model achieved higher accuracy compared to the performance of radiologists as described in the results. The low accuracy of the radiologists may be due to the difficulty in determining macroscopic ECE involvement using limited visual interpretive findings. The radiomics model has a low risk of bias, as assessed by the QUADAS tool and RQS scale, but its MATLAB feature generation approach is not open-source and uses non-standard techniques, making it difficult to replicate. The study performed an internal training and validation split of 2:1 but, did not specify which dataset was used for feature selection, which may have affected the results. The model has not been independently evaluated at another institution, and further studies are needed to validate its performance. Nevertheless, this study suggests that it is possible to use peri-tumoural regions to create radiomics signatures for predicting ECE [21].

In relation to the remaining studies [23,24,25,26,27], clinical features were used to construct combined models in addition to the radiomics features. The dominant clinical predictors were the serum PSA and Gleason Score (GS) of the biopsy. They reported better results in combined models with clinical and radiomics variables compared with models using radiomics features alone, with values that varied between 72% and 95% and with moderate risk of bias in the QUADAS-2 evaluation and RQS between 16 and 18 points. From these, Fan et al [24] study had the best accuracy, and the textural extraction features were derived from the ROI drawn on DWI. However, it is a single institution study using two different scanners with an internal validation dataset.

Two studies [28, 29], created a combined model that associated semantic features from the radiologist’s interpretation with the radiomics model and clinical features. In the study [28] the AUC of radiomics model is almost the same than sematic interpretative model (AUC 0.75 vs. AUC 0.74, respectively). However, the model was executed in a limited way, without a validation group, and the risk of bias was very high (high in QUADAS and only 9 in RQS). In the other [29], more recent research, the authors proved that a combined model achieved the best AUC in the validation group compared to the other models. However, this study should be conducted to utilise external cohorts (form different institutions) to validate the robustness of radiomics and combined models to detect ECE.

While all the studies used one or more feature selection strategies, in order to reduce the overfitting. In addition, the use of different feature sets in different studies led to a lack of consistency in the features present in the final models, as previously mentioned, precluding any attempt to synergistically analyse the relevant radiomics features for predicting pECE across studies.

Future radiomics studies must ensure high-quality data collection and standardisation of radiomics features across different institutions and imaging protocols. The IBSI (Image Biomarker Standardisation Initiative) seeks to provide image biomarker nomenclature and definitions, benchmark datasets, and benchmark values to verify image processing and image biomarker calculations, as well as reporting guidelines, for high-throughput image analysis.

By addressing these concerns, future radiomics studies can enhance the reliability and clinical utility of radiomics signatures in detecting ECE in patients with PCa before surgery.

This review had certain limitations. Firstly, although our search strategy was comprehensive, there could be studies that were published between our search period and the publication of this review. Secondly, this systematic review only focused on radiomics signatures and did not analyse other AI methods, semantic interpretative scores or nomograms to detect ECE.

Finally, this review would have benefited from a quantitative synthesis or metanalysis of the analysed articles, but unfortunately was not possible as key statistical data and the dominant features were not reported for this small sample of studies.

Conclusion

Non-imaging biomarkers such as PSA and GS have shown promise in predicting ECE in PCa. When combined with MRI data, Radiomics signatures could enhance accuracy in predicting ECE. However, current evidence lacks robustness to support the clinical use of radiomics signatures for ECE detection pre-surgery. Future radiomics studies need prospective testing in multicentre settings with large datasets, including external validation cohorts, to enhance reliability and clinical utility in detecting ECE.