Introduction

Multiparametric magnetic resonance imaging (mpMRI) is an important modality in standard of care for prostate cancer (PCa) thanks to its excellent soft-tissue contrast, spatial resolution, and simultaneous acquisition of multiple parameters [1]. The combination of multiple sequences in mpMRI has improved tumor detection and characterization in PCa management pathways and enhanced staging accuracy [2]. While the diagnostic performance of T2-weighted imaging (T2WI) alone is inadequate compared to mpMRI, T2WI remains key for lesion analysis [1,2,3], due to its high-resolution anatomical information of the prostate [4] and its role in Prostate Imaging Reporting and Data System (PI-RADS) [5]. Nevertheless, tumor region ambiguity and variations in signal intensity (SI) present challenges for T2WI [4].

Radiomics, high-throughput computational analysis of radiological imaging, has recently gained attention in PCa research through imaging biomarkers that potentially add value to PI-RADS [6, 7]. Significant associations of radiomics features with pathophysiological processes in several clinical utility studies have highlighted their potential for PCa diagnosis, risk stratification, prognosis, and predicting response to treatment [8,9,10,11]. Quality assessment and standardization of radiomics features is important to ensure stability and clinical relevance of potential biomarkers [6, 12]. Image pre-processing is recommended in the radiomics workflow to circumvent acquisition susceptibilities, standardize image quality, and ensure reproducibility and validity of radiomics features [6, 12, 13]. This has driven utilization of reproducibility, robustness, predictive power, reliability, and stability measurements as important aspects for further radiomics studies [14,15,16,17]. With reproducible radiomic features, the analysis of disease characteristics, staging disease progression, and tracking treatment response across different protocols become more reliable. This is particularly beneficial for monitoring progression in Active Surveillance [6, 18]. Although clinical utility studies [8,9,10,11] show strong pathophysiological associations between radiomics features and PCa, to our knowledge, no study has specifically investigated the impact of PCa characteristics on radiomics features reproducibility.

The aim of this study is to evaluate the reproducibility of radiomics features derived with different pre-processing settings from two T2WI scans of prostate lesions acquired between two different time points at short interval, and to evaluate the impact of PCa characteristics on the reproducibility of radiomics features.

Materials and methods

The overall radiomics workflow of the study is shown in Fig. 1. In this study, radiomics features were extracted from suspected lesions on two T2WI examinations using 48 different pre-processing settings separately. Feature reproducibility was measured, and the pre-processing setting with the best reproducibility was selected. Subsequently, the impact of clinical variables (i.e., PCa characteristics) on feature reproducibility was evaluated. In addition, the association between feature values and clinical variables was evaluated.

Fig. 1
figure 1

The workflow of the study. Regions of interest (ROIs) were first manually delineated on images from the two T2-weighted imaging (T2WI) examinations (t1 and t2). Subsequently, 107 radiomics features were extracted from the ROIs after the images were separately pre-processed using 48 different settings (combinations of pre-processing parameters). The extracted features were then used to assess reproducibility

Patient cohort

All patients gave informed consent for the study, which was approved by the institutional review board and the Regional Committee for Medical and Health Research Ethics (REC Central Norway, identifiers 2013/1869 and 2017/576).

A total of 53 patients with histologically confirmed PCa from a previous prospective study [19] were selected for this study. Patients were examined between March 2015 and December 2017 due to suspicion of PCa. Each patient had two consecutive mpMRI examinations (median interval = 5 days, range: 0–16 days). The first examination was a detection scan according to PI-RADS v2 guidelines [5] and the second was used to guide in-bore biopsy sampling. Note that patients did not undergo any therapy or receive treatment between the two examinations. Patients with PI-RADS < 3 (n = 3) were excluded, and the remaining 50 patients (median age = 66, range: 48–75 years) with a total of 74 suspicious lesions (PI-RADS ≥ 3) were included.

Clinical variables including prostate-specific antigen density (PSAD), prostate volume, PI-RADS score, and the International Society of Urological Pathology (ISUP) score [20] were collected for each patient. Details of the patient cohort are provided in Table 1.

Table 1 Details of the patient cohort included in the study

MRI acquisitions

Axial T2WI were scanned with a 3T MRI system (MAGNETOM Skyra, Siemens Healthineers, Erlangen, Germany) using a turbo spin-echo sequence. A summary of the acquisition parameters is provided in Table 2.

Table 2 A summary of the T2W MRI sequence parameters

Pre-processing settings

In this study, we investigated 48 pre-processing settings (Table S1 of Supplementary Information 1) resulting from all possible combinations of the following pre-processing parameters:

Gray-level discretization and binning

Image intensities were discretized to accommodate optimal extraction of radiomics features using two methods: relative discretization (Fixed Bin Number [FBN]) and absolute discretization (Fixed Bin Size [FBS]) [12, 21, 22]. Four binning values were investigated for each discretization method: 16, 32, 64, and 128 for FBN and 5, 10, 20, and 40 for FBS.

SI normalization

The two pre-processing modes, namely AR and NAR, for including and excluding signal intensity (SI) normalization from the workflow, respectively, were investigated. In the AR mode, AutoRef [23], an automated dual-reference tissue (fat and muscle) normalization method, was used.

Intensity outlier filtering

Intensity outlier filtering (i.e., dynamics filtering) was used as a range re-segmentation to include only region-of-interest (ROI) voxels within [μ ± 3σ], where μ denotes the mean and σ the standard deviation of intensity [24]. Three modes of intensity outlier filtering were investigated: no outlier filtering (NoF), limitation of dynamics filtering by re-setting the voxels outside of [μ ± 3σ] range to the upper or lower threshold value (IN), and limitation of dynamics filtering by excluding voxels outside the [μ ± 3σ] range from the mask (OUT).

All pre-processing was performed using Python (v3.7.9) except for SI normalization which was performed using Matlab R2020a (The MathWorks, Inc., USA).

Manual delineation

All 74 individual lesions were manually delineated on T2WI for both scans based on PI-RADS reports by a radiology resident (E.S.) with over 5 years of experience in examining PCa lesions at St. Olavs Hospital, Trondheim University Hospital, Trondheim, Norway using ITK-SNAP [25] (v3.6).

Feature extraction

Radiomics features were extracted from the ROIs (i.e., 74 individual lesions) using PyRadiomics (v3.0) [26] separately for each of the 48 pre-processing settings. In each setting, a total of 107 radiomics features were extracted from the following 7 feature groups: First-Order Statistics (FO, 18 features), Gray-Level-Co-occurrence Matrix (GLCM, 24 features), Gray-Level-Dependence-Matrix (GLDM, 14 features), Gray-Level-Run-Length-Matrix (GLRLM, 16 features), Gray-Level-Size-Zone-Matrix (GLSZM, 16 features), Neighboring-Gray-Tone-Difference-Matrix (NGTDM, 5 features), Shape in 3D (14 features). The feature extraction settings were set to default, except for the investigated pre-processing parameters. Details on the PyRadiomics default settings can be found in Table S2 of Supplementary Information 1.

Statistical analysis

The inter-scan reproducibility of each feature for each pre-processing setting was measured using all of the individual lesions (74 lesions) with the two-way random, single score intra-class correlation coefficient (ICC) [27]. Features with ICC > 0.75 were considered to have good reproducibility. The ICCs for each pre-processing setting were compared and the setting that yielded the highest number of features with good reproducibility was selected. In case of a tie, the setting with the lowest binning number was chosen.

Next, the selected pre-processing setting was used to assess the differences in ICCs, measured using index lesions (50 lesions), between clinical variables categories, which included PSAD (low [≤ 0.15 ng/mL2] vs. high [> 0.15 ng/mL2]), prostate volume (small [< 40 mL] vs. enlarged [≥ 40 mL]), PI-RADS scores (3, 4, and 5), and ISUP scores (< 1, 1, and > 1). Additionally, the selected pre-processing setting was used to evaluate the association between radiomics feature values (extracted from the index lesions of baseline scans) and clinical variables categories.

To assess the differences for two and multiple groups of ICCs or radiomic feature values, the two-tailed Mann–Whitney U and Kruskal–Wallis tests, respectively, were used. All tests were performed separately followed by Benjamini–Hochberg correction for multiple comparisons [28], and corrected p values < 0.05 were considered statistically significant.

All statistical analyses were performed using Matlab R2022b (The MathWorks, Inc., USA).

Results

Reproducibility and pre-processing parameters

A heatmap depicting the overall reproducibility of all radiomics features extracted from all lesions with respect to the 48 pre-processing settings is presented in Fig. 2 (see Supplementary Information 2 for numerical table). The fluctuations in the ICC values indicate that the pre-processing settings have a substantial impact on the reproducibility of the radiomics features.

Fig. 2
figure 2

Reproducibility heatmap of intra-class correlation coefficient (ICC) values showing the overall reproducibility of 5136 elements from 107 radiomics features with respect to 48 pre-processing settings. All the individual lesions (74 lesions) were used to extract the radiomics features

The mean ± standard deviation (SD) of all ICCs was 0.46 ± 0.27, with the highest ICC value of 0.97 obtained by Shape-MeshVolume in all settings and the lowest ICC value of − 0.14 obtained by GLSZM-SmallAreaLowGrayLevelEmphasis in setting 37 (FBS20, AR, NoF). Overall, 18.7% of all features had good reproducibility (ICC > 0.75), including 16 features that were consistently reproducible across all pre-processing settings (listed in Table S3 of Supplementary Information 1). By feature group, Shape had the highest ICC (0.81 ± 0.14), followed by GLDM (0.49 ± 0.25), NGTDM (0.48 ± 0.27), FO (0.44 ± 0.25), GLSZM (0.44 ± 0.25), GLRLM (0.42 ± 0.26), and GLCM (0.31 ± 0.20).

A comparison of all ICCs across the different pre-processing parameters was performed to determine the impact of each parameter on overall feature reproducibility (Fig. 3) and on each feature group reproducibility (Fig. S1 of Supplementary Information 1). Overall, FBN (0.51 ± 0.25) had significantly higher reproducibility than FBS (0.41 ± 0.29). The reproducibility varied with the change of binning in gray-level discretization. In general, increasing the bin number in FBN improved the reproducibility of texture feature groups, while increasing the bin size in FBS improved reproducibility in GLDM and decreased reproducibility in NGTDM. SI normalization by AR resulted in significantly higher reproducibility (0.48 ± 0.26) than NAR (0.44 ± 0.28) overall and in FO, GLRLM, and GLSZM. Intensity outlier filtering showed no significant differences among NoF (0.46 ± 0.27), IN (0.47 ± 0.27), and OUT (0.47 ± 0.28) overall or among feature groups. Details of the ICCs across different pre-processing parameters can be found in Table S4 of Supplementary Information 1.

Fig. 3
figure 3

Comparison between intra-class correlation coefficient (ICC) values across different pre-processing parameters of all settings. The impacts of gray-level discretization, binning values of Fixed Bin Number (FBN) and Fixed Bin Size (FBS), signal intensity normalization, and intensity outlier filtering on the reproducibility of feature groups are shown. Significant differences are marked with *

Selected pre-processing setting

Figure 4 shows the distribution of features with good reproducibility among the 48 pre-processing settings. Based on the selection criteria, the selected setting was setting 18 (FBN64, NAR, OUT) with 25 features with good reproducibility (Table 3).

Fig. 4
figure 4

A stacked bar chart of features with good reproducibility across pre-processing configurations. Based on the selection criteria, setting 18 was selected

Table 3 The intra-class correlation coefficient (ICC) values of the features with good reproducibility for the selected pre-processing setting

Reproducibility and clinical variables

Figure 5 shows comparison of feature reproducibility for different categories of clinical variables, using the selected pre-processing setting.

Fig. 5
figure 5

Comparison between intra-class correlation coefficient (ICC) values across different clinical variables categories of the selected pre-processing setting. The impacts of prostate-specific antigen density (PSAD), prostate volume, Prostate Imaging Reporting and Data System (PI-RADS) score, and the International Society of Urological Pathology (ISUP) score are shown. Only the index lesions (50 lesions) were used to extract the radiomics features

No significant differences in feature reproducibility were found between any of the clinical variable groups. However, radiomics feature reproducibility was generally higher at low PSAD values (0.60 ± 0.30) compared to high PSAD values (0.50 ± 0.28), comparable for small prostate volumes (mean ± SD = 0.53 ± 0.30) than for enlarged prostate volumes (mean ± SD = 0.54 ± 0.27), lower for PI-RADS 5 lesions (0.43 ± 0.25) than for PI-RADS 3 (0.48 ± 0.37) and PI-RADS 4 (0.52 ± 0.25) lesions, and higher for ISUP < 1 lesions (0.53 ± 0.34) than for ISUP 1 (0.44 ± 0.34) and ISUP > 1 lesions (0.47 ± 0.30). More detailed information on the comparison results at the feature group level can be found in Figure S2 and Table S5 of Supplementary Information 1.

Association between baseline feature values and clinical variables

Table 4 displays the association between radiomics feature values of the 25 selected features with good reproducibility in the selected pre-processing setting, while Table S6 of Supplementary Information 1 shows the association between the remaining 82 less reproducible features and clinical variables. Notably, among the features with good reproducibility, significant differences were observed among PI-RADS scores in 23 features, while for features with low reproducibility, significant differences were observed in 41 features. However, no significant differences in feature values were observed for PSAD, prostate volume, and ISUP score.

Table 4 The p values result from significance tests between the values of the radiomics features with good reproducibility from the selected pre-processing setting and the clinical variable categories

Discussion

Radiomics features have shown the potential to improve PCa diagnosis, risk stratification, prognosis, and prediction of response to treatment [8,9,10,11]. However, feature reproducibility plays a critical role in the development of stable radiomics models. To increase the robustness of radiomics models and improve their predictions, standardized image pre-processing is essential [14,15,16,17]. In this study, we, therefore, aimed to evaluate the reproducibility of radiomic features derived from two T2WI scans of the prostate acquired within a short time interval using different pre-processing settings (i.e., combinations of parameters). Our goal was also to determine the pre-processing setting that yielded the highest number of reproducible features. In addition, we evaluated the influence of disease characteristics (i.e., clinical variables) on radiomics feature reproducibility and tested the association between radiomics feature values and clinical variables.

The study focused on T2WI due to its important role in lesion analysis [1,2,3], as well as its prevalence in PCa radiomics studies [8,9,10,11],

The median time interval between our two consecutive T2WI examinations was 5 days. No changes in the prostate gland or lesions are expected in this short interval as PCa is typically slow-growing [29].

Three pre-processing parameters were investigated in this study: gray-level discretization with varied binning, SI normalization, and intensity outlier filtering. The parameters were selected based on the recommendations of the image biomarker standardization initiative (IBSI) [12].

Our results show a substantial influence of the pre-processing parameters on the reproducibility of the radiomics features. The gray-level discretization seems to have the strongest influence on the reproducibility, where FBN discretization significantly increases the reproducibility. The results supports the recommendations of IBSI [12] and van Timmeren et al. [22] The superiority of FBN over FBS can be due to the normalization-like effect that FBN produces, which benefits images with arbitrary units [12] such as those of T2WI. On the other hand, the results is in contradiction with the findings of Duron et al. [21] and the recommendations of PyRadiomics [26], which was based on findings from Leijenaar et al. [30]. This contradiction could be because these two studies focused on different organs, binning values, and/or scanning modalities. This indicates that, for each clinical use case, careful selection of the optimal trade-off between discretization and binning value is still needed, mainly when other pre-processing parameters such as SI normalization and intensity outlier filtering are involved [12, 22, 31].

SI normalization was included in the study as it is a common pre-processing step when working with T2WI and due to its ability to alter the reproducibility of radiomics features [32]. The AutoRef normalization method was selected for our study, as it was shown to outperform other methods for normalization of T2WI [23, 33]. Our study indicated that including normalization in the workflow can increase feature reproducibility in most feature groups.

Intensity outlier filtering has been widely used for reliable texture assessment in MRI [22, 24, 34]. Adjusting outliers avoids the dependence of intensity on the shift of [μ ± 3σ], making it suitable for T2WI with arbitrary values [24]. However, similar to previous research [24, 34], our results showed no significant effect on reproducibility of this filtering.

The IBSI recommendations can help provide more standardized radiomics features. However, the selection of a pre-processing setting that yields the highest number of reproducible features remains dependent on the application and dataset. In this study, we selected a pre-processing protocol with a fixed bin number of 64, without SI normalization, and intensity outlier filtering by excluding voxels outside the [μ ± 3σ] range from the mask. This selection is rational from a pre-processing perspective. The bin number of 64 is an intermediate value among the most frequently used FBN bin numbers that is compatible with T2WI lesions, where the image details can still be well-preserved [22, 35]. Although SI normalization is beneficial in many pre-processing protocols, it was not required in this study. This is potentially due to our dataset being from a single center, which can be assumed to have more homogeneous image quality characteristics [36], as well as the fact that the combination of high binning value in FBN and intensity outlier filtering maintained sufficient reproducibility performance, leading to reproducible results for most of the settings.

The pre-processing protocol selected in this study resulted in 25 features that exhibited good reproducibility. Additionally, the study identified 16 features that consistently demonstrated reproducibility across all pre-processing settings. Incorporating the 25 features after applying the selected pre-processing protocol, or utilizing the 16 consistently reproducible features with alternative pre-processing protocols, may enhance the robustness of the radiomics-based models. However, further research is required to validate that.

Some of the ICCs reported in this study differed from those reported in other works [15, 16]. This could be due to differences in study design, dataset, acquisition settings, feature sensitivity, and software packages. The low percentage of features with good reproducibility suggests the high sensitivity of radiomics features; similar conclusions have been made in the previous studies [15, 16]. However, the high reproducibility of all Shape features was expected, since the ROIs manual delineations were performed by a single reader. Moreover, Sunoqrot et al. [37] showed in their work that Shape features maintain high reproducibility even when automated segmentation methods are applied on specific regions.

To our knowledge, no study has specifically examined the influence of PCa characteristics on the reproducibility of radiomics features. Therefore, we conducted this investigation to assess the impact of changes in disease characteristics on calculated radiomics features. Understanding the effect of disease characterizations is crucial as it draws our attention to consider these factors when developing radiomics models.

In this study, no significant differences were found in the overall reproducibility of features across the various clinical variable categories. However, there was a trend toward higher reproducibility of features extracted from patients with less advanced or aggressive PCa (low PSAD, small prostate volume, PI-RADS < 5, and ISUP < 1) compared to those with more advanced PCa. This trend could be attributed to the increased heterogeneity observed in more advanced or aggressive PCa, which likely influenced the calculated features. Although no significant differences were found in the overall reproducibility of features across various clinical variable categories, this trend suggests a potential relationship between disease characteristics and feature reproducibility.

The association between radiomics feature values extracted from the baseline scan using the selected pre-processing setting and the clinical variables categories showed a significant difference only among PI-RADS scores. This finding suggests that the use of radiomics features for the classification of PI-RADS is promising, as shown by Brancato et al. [38], who demonstrated a high diagnostic efficacy of radiomics models in the classification of PI-RADS 3 findings. In contrast to the results of other studies [9, 10], this study showed no significant difference among ISUP scores. However, in this study, the sample sizes of ISUP categories were small (11, 8, and 31 cases, respectively, for ISUP < 1, ISUP 1, ISUP > 1), so no definite conclusion can be drawn.

Overall, the study demonstrated the significance of carefully selecting the pre-processing settings for radiomics features and considering the impact of disease characteristics on these features. By doing so, it is possible to develop more robust and reliable radiomics-based models that can be used to analyze disease characteristics and track treatment outcomes across different protocols. This is particularly important for monitoring disease progression in Active Surveillance [6, 18] where the reliability of such models is crucial.

Our study has some limitations. Our cohort was relatively small, the dataset was acquired at a single center, and the ROIs were delineated by only one radiologist. This might have led to a less generalized dataset compared to other multi-center radiomics studies.

Conclusions

We investigated the reproducibility of radiomics features derived with different pre-processing settings from two T2WI scans of the prostate acquired from a short time interval. Our results show that pre-processing parameters influenced the reproducibility of radiomics features from T2WI. The most reproducible pre-processing setting included discretization with a fixed bin number of 64, without SI normalization, and intensity outlier filtering by excluding voxels outside the [μ ± 3σ] range from the mask. This setting resulted in 25 features with good reproducibility. Moreover, the results showed that disease characteristics (i.e., clinical variables) do not have a significant impact on the radiomics features reproducibility.