Introduction

Radiomics derives objective and quantifiable imaging biomarkers from medical images to provide insights beyond subjective and qualitative image analysis [1, 2]. This pixel-level image analysis approach allows additional in-depth features that are invisible to the human naked eye, creates new possibilities, and promises for fostering the big data trends in healthcare [3, 4]. Radiomics has shown its possible capabilities in tumor classification [5], response prediction [6], and risk stratification [7] in oncologic imaging. Additionally, the potential of radiomics has also been presented in non-oncologic diseases, such as coronary plaque [8], pancreatitis [9], pneumonia [10], Crohn’s disease [11], kidney stone [12], etc. There is a huge number of academic papers on radiomics research, and it is still increasing [13, 14]. However, the radiomics analysis has not been widely implanted into clinical routine [15,16,17,18], since it is currently not supported by adequate scientific evidence.

One of the most significant challenges of radiomics analysis is the lack of robustness [19,20,21,22,23,24]. The influence of acquisition and reconstruction parameters have been demonstrated to have impacts on the robustness of CT radiomics features in conventional single-energy CT (SECT) systems, including scan system, radiation dose level, voxel size, reconstruction algorithm, reconstruction kernel, etc. [25,26,27]. The dual-energy CT (DECT) systems have introduced more potential influencing factors on radiomics features, as there is a heterogeneous technique to realize the DECT scans, such as dual-source dual-energy CT (dsDECT), rapid kV-switching dual-energy CT (rsDECT), dual-layer dual-energy CT (dlDECT), sequential scanning dual-energy CT (ssDECT), and split filter dual-energy CT systems [28]. The radiomics features were non-reproducible neither between the conventional CT and DECT scans, nor across different DECT techniques [29,30,31,32,33], even though the parameters were carefully adjusted. The photon-counting detector CT (PCD-CT) system are believed to allow high feature stability and better characterization of disease [34,35,36,37,38,39] since it directly converses photons into electric pulses without the intermediate step of visible light [40]. Nevertheless, it may lead to extra differences in radiomics features due to higher image resolution compared to traditional energy-integrating CT systems [34,35,36]. As the PCD-CT systems are not widely available nowadays, it is important to determine whether the images from generated using different CT techniques are consistent enough for radiomics analysis.

Therefore, this study aimed to evaluate robustness of radiomics features on texture phantom scans using one PCD-CT system and four DECT systems.

Materials and methods

The workflow of the study is presented in Fig. 1. The institution’s ethics approval and written informed consent are not required since this was a phantom study.

Fig. 1
figure 1

Study workflow. This study consisted of three steps: image acquisition, image processing, and statistical analysis. A homemade texture phantom was scanned on one PCT-CT system and four types of DECT systems, at three dose levels of 5, 10, and 20 mGy. The raw data was generated into VMIs at 70 keV. Pyradiomics was employed to extract 18 first-order and 75 texture radiomics features from ROIs segmented with a rigid registration. Test-retest repeatability between repeated scans was assessed by Bland-Altman analysis. The intra-system reproducibility between dose levels, and inter-system reproducibility within the same dose level, were evaluated by ICC and CCC. Inter-system variability among five scanners was estimated by CV and QCD

Phantom

We established a texture phantom consisting of twenty-eight different materials as shown in Fig. 2. There were five wood blocks and twenty-three bottles filled with different materials [25, 41]. The wood block was a cuboid with a size of 150 mm × 55 mm × 45 mm. The cuboid part of the bottle was with a size of 130 mm × 55 mm ×45 mm. The cuboid part bottle was filled with materials as tightly as possible. These materials were selected to give us varying textures. The materials were positioned to avoid beam-hardening artifacts and were kept unchanged throughout all the scans in the study. The details of the set-ups of the texture phantom are presented in Supplementary Note S1.

Fig. 2
figure 2

Phantom construction and image segmentation. A The inserts of the homemade phantom were made of wood blocks and bottles filled with different materials. B The inserts were placed in a foam plastic box and kept stable across all scans. The inserts used in the current study were: (1) air, (2) mesoporous sponge, (3) iodize-free salt, (4) granulated sugar, (5) flour, (6) iodized salt, (7) coarse-pore sponge, (8) nutritive soil for succulent plants, (9) rosewood, (10) chicken wing wood, (11) beechwood, (12) zebra wood, (13) basswood, (14) sand, (15) microporous sponge, (16) coix seed, (17) buckwheat, (18) sago, (19) cat litter, (20) oat, (21) sawdust, (22) soybean, (23) red bean, (24) mung bean, (25) rice, (26) quinoa, (27) millet, and (28) chia seed. C CT image of a representative axial slice of each material in the phantom. D A total of twenty-eight regions of interest were manually contoured on the reference scan and then copied to all other scans with a rigid registration

Image acquisition and reconstruction

The phantom was scanned on five CT scanners including one PCD-CT system (NAEOTOM Alpha, Siemens Healthineers) and four dual-energy CT systems (dsDECT, SOMATOM Force, Siemens Healthineers; rsDECT, Revolution CT Apex, GE Healthcare; dlDECT, Hawk spectral CT, Philips Healthcare; and ssDECT, Aquilion ONE, Canon Medical Systems) from two centers, respectively. The comparable acquisition and reconstruction parameters for scans are presented in Table 1. Each scan was repeated several minutes apart to allow scan-rescan repeatability assessment. The field of view (500 mm × 500 mm), reconstruction matrix (512 × 512), and slice thickness (5 mm) were kept unchanged to allow stable voxel size. The milliamperage, rotation time, and pitch value were adjusted to meet the volume CT dose index of 5, 10, and 20 mGy. The tube voltage, iteration reconstruction method, and reconstruction kernel were selected to present a typical abdomen-pelvic examination. All the images were reconstructed into virtual monoenergetic images (VMIs) at the energy level of 70 keV per vendor-specific workstations relying on comparable linear energy blending approaches. The kilo-electron volt level of 70 keV was chosen because this energy level was used as a clinical standard of reference at our institution and has been suggested to be comparable to conventional images [42,43,44].

Table 1 CT acquisition and reconstruction parameters

Segmentation and feature extraction

The images were exported in Digital Imaging and Communications in Medicine (DICOM) format, and then converted to Neuroimaging Informatics Technology Initiative (NIFTI) format using MRIcroGL version 1.2.20220720b (https://www.nitrc.org/frs/?group_id=889). The images were loaded into ITK-SNAP version 4.0.2 (http://www.itksnap.org/pmwiki/pmwiki.php) for segmentation by a radiologist with 5 years of experience in radiology and radiomics phantom research [30,31,32,33]. Twenty-eight regions of interest (ROIs) of 35 pixels (approximately 34 mm) in diameter were put at the center of each wood block or bottle with different materials following a rigid registration to avoid unexpected variations [25]. The ROIs were placed on one reference scan and then copied to other scans. Each ROI was copied to the continuous middle five layers of the image of each wood block or bottle with different materials for radiomics feature extraction. We did not perform any pre-processing steps before the feature extraction. Python version 3.12.1 (https://www.python.org) with PyRadiomics package version 3.0.1 (https://pyradiomics.readthedocs.io/en/latest/) was used to extract 18 first-order features and 75 texture features, namely 24 gray-level co-occurrence matrix (GLCM), 14 gray-level run length matrix (GLRLM), 16 gray-level zone length matrix (GLZLM), 16 gray-level dependence matrix (GLDM), and 5 neighborhood gray-tone difference matrix (NGTDM) features [45]. The 26 shape features were not included since the ROIs were fixed in this study. The settings of feature extraction and calculated features are presented in Supplementary Note S2.

Radiomics robustness analysis

The test-retest repeatability was assessed using the middle five layers of images from two repeating scans with unchanged acquisition and reconstruction parameters from the same system. The intra-system reproducibility between different dose levels was evaluated between images from 5 vs. 10 mGy, 5 vs. 20 mGy, and 10 vs. 20 mGy scans, respectively, within the same system. The inter-system reproducibility was calculated using images acquired at three dose levels of 5, 10, and 20 mGy scans, respectively, between each two out of the five CT systems. The inter-system variability at three dose levels of 5, 10, and 20 mGy scans was estimated across five systems for each of the twenty-eight materials. The robustness of radiomics was also analyzed according to five feature types. The signal-to-noise ratio of each scan was calculated.

Statistical analysis

The statistical analysis was performed using R language version 4.1.3 (https://www.r-project.org/) within RStudio version 1.4.1106 (https://posit.co/). The mean relative change of the radiomics features across the different datasets was calculated. The test-retest repeatability was assessed using Bland-Altman analysis with a cutoff of 90% [46, 47]. The intra-system reproducibility between different dose levels, and inter-system reproducibility within the same dose level, were evaluated by intraclass correlation coefficient (ICC) of two-way mixed effects, single rater, absolute agreement type [48] and concordance correlation coefficient (CCC) [49]. The inter-system variability among the five systems was assessed by coefficient of variation (CV) [50] and quartile coefficient of dispersion (QCD) [51]. The ICC and CCC values were interpreted as follows: poor, < 0.50; moderate, 0.50–0.75; good, 0.75–0.90; or excellent, ≥ 0.90, while the CV and QCD values were interpreted as follows: acceptable, < 10%; moderate but still adequate, 11%–20%; and too high and inadequate, ≥ 20% [52].

Results

Test-retest repeatability of radiomics features

The percentage of repeatable features ranged from 82.8 to 100.0%, and the overall percentage ± standard deviation of repeatable features was 97.1 ± 6.2%, according to Bland-Altman analysis (Supplementary Table S1 and Supplementary Fig. S1). The signal-to-noise ratio of each scan (Supplementary Table S2) and the mean relative change of the radiomics feature in reference to PCD-CT (Supplementary Table S3) were calculated. The results were also summarized according to five feature types (Supplementary Tables S4 to S7).

Intra-system reproducibility among three dose levels

The overall mean ± standard deviation ICC and CCC values for intra-system reproducibility were 0.945 ± 0.079 and 0.945 ± 0.079, respectively (Table 2), and the percentage of features with ICC > 0.90 and CCC > 0.90 were 86.0% and 85.7%, respectively (Fig. 3). The mean ± standard deviation ICC and CCC values of five CT systems ranged from 0.916 ± 0.112 to 0.978 ± 0.041, and from 0.915 ± 0.112 to 0.977 ± 0.041, respectively. The percentage of features with ICC > 0.90 and CCC > 0.90 ranged from 76.3% to 95.0%, and from 76.3% to 95.0%, respectively. The results for each feature were summarized (Supplementary Fig. S2).

Table 2 Intra-system reproducibility among three dose levels
Fig. 3
figure 3

Percentage of robust radiomics features according to intra-system reproducibility. The percentage of robust features in terms of intra-system reproducibility among three dose levels of (A) 5 vs. 10 mGy, (B) 5 vs. 20 mGy, and (C) 10 vs. 20 mGy, according to ICC and CCC values. The ICC and CCC values were interpreted as follows: poor, < 0.50; moderate, 0.50–0.75; good, 0.75–0.90; or excellent, ≥ 0.90

Inter-system reproducibility within the same dose level

The overall mean ± standard deviation ICC and CCC values for inter-system reproducibility were 0.157 ± 0.174 and 0.157 ± 0.174, respectively (Table 3). None of the features were with ICC > 0.90 or CCC > 0.90, while 92.6% and 92.7% of features were with ICC < 0.50 and CCC < 0.50 (Fig. 4). There were only between dsDECT and rsDECT systems that showed 8.2% and 8.2% of features with ICC of 0.75–0.90 and CCC of 0.75–0.90, respectively. The results for each feature were summarized (Supplementary Fig. S3).

Table 3 Inter-system reproducibility within the same dose level
Fig. 4
figure 4

Percentage of robust radiomics features according to inter-system reproducibility. The percentage of robust features in terms of inter-system reproducibility among five scanners within the same dose level of (A) 5 mGy, (B) 10 mGy, and (C) 20 mGy, according to ICC and CCC values. The ICC and CCC values were interpreted as follows: poor, < 0.50; moderate, 0.50–0.75; good, 0.75–0.90; or excellent, ≥ 0.90

Inter-system variability among five scanners

The overall mean ± standard deviation CV and QCD values for inter-system reproducibility were 88.8 ± 478.3% and 91.8 ± 2797.5%, respectively (Table 4 and Supplementary Fig. S4). The percentage of features with CV < 10% and QCD < 10% were 6.5% and 12.8%, respectively (Fig. 5). The inter-system reproducibility was heterogeneous among different materials, with mean ± standard deviation CV values ranged from 44.0 ± 42.1% to 437.6% ± 344.2%, and mean ± standard deviation QCD values from 25.6% ± 21.5% to 641.6% ± 182.2%, respectively. The percentage of features with CV < 10% and QCD < 10% ranged from 3.2% to 15.1%, and from 4.3% to 35.5%, respectively.

Table 4 Inter-system variability among five scanners according to materials
Fig. 5
figure 5

Percentage of robust radiomics features according to inter-system variability. The percentage of robust features in terms of inter-system variability among five scanners within the same dose level of A 5 mGy, B 10 mGy, and C 20 mGy, according to CV and QCD values. The CV and QCD values were interpreted as follows: acceptable, < 10%; moderate but still adequate, 11%–20%; and too high and inadequate, ≥ 20%

Discussion

Our study showed that the repeatability of radiomics features was heterogeneous among CT techniques, in which PCD-CT, dsDECT, and rsDECT have relatively higher repeatability. On the other hand, the difference in radiation dose levels has less impact on the radiomics features. Notably, the radiomics features derived from images using different CT techniques were not reproducible to each other with significant variability in radiomics feature values, despite using carefully adjusted protocols.

The influence of the DECT technique has a great impact on the robustness of radiomics features. The phantom study showed that different DECT techniques led to the variability of radiomics features even though comparable parameters were used [30]. The deep learning image reconstruction algorithms cannot harmonize the variability of radiomics due to different DECT techniques [31]. However, the deep learning image reconstruction algorithms showed potential for minimizing radiomics variability that related to radiation dose level difference [33]. One opportunity for improving radiomics robustness across DECT systems is synchronizing energy levels of VMI to reach similar CT number values [32], while the potential influence on the diagnostic performance of the approach has not been estimated yet. Further studies in patients supported the phantom results that the radiomic features across different DECT systems were low, but the robust radiomics features were not reflected in the phantom experiment using the same parameters [29]. These studies compared dsDECT, rsDECT, and dlDECT systems, while our study further strengthened the results by including the ssDECT system. We used acquisition and reconstruction parameters as comparable as possible between the systems because we aimed to focus on the difference due that are specific to DECT systems. Therefore, the main sources of the low inter-system reproducibility were considered to be multi-energy acquisition or material decomposition techniques [29]. It is evident that they have a significant impact on the quantification of iodine [53,54,55]. Further, our study applied a phantom with materials of heterogeneous texture to validate the results. We summarized that the radiomics features were hard to be reproducible across the DECT systems whether the phantoms were homogeneous or heterogeneous, or were imitating physiological organ parenchyma [29, 41].

The image noise can be an important source of variability of radiomics features among CT systems. The studies showed that PCD-CT systems can provide high radiomics feature repeatability [37, 39], and high reproducibility between different radiation dose levels [34, 37], but heterogeneous reproducibility between VMIs at different keV levels [38]. Our study supported the high repeatability and high reproducibility of PCD-CT between different radiation dose levels. These results were in accordance with those in organic phantoms which the repeatability after repositioning and reproducibility between different tube currents were high [37]. It is reasonable since the PCD-CT has allowed for improved visualization and quantification even with ultra-low-dose imaging and obese patients [56], as it can remove the electronic background noise [57]. It is theoretically beneficial for radiomics analysis since this pixel-level image analysis approach is fragile to slight differences in images [25, 30]. In contrast, the repeatability in DECT systems can be suboptimal depending on the material decomposition technique [58, 59]. However, the dose level in our study was not that low to challenge some of the DECT systems in our study by electronic background noise. Therefore, the comparable high repeatability can be partially attributed to the relatively high radiation dose used in the study, in addition to the technique itself. Moreover, the same dose level does not guarantee the same signal-to-noise ratio as different DECT systems rely on very different technologies. However, it may be not possible to obtain the same signal-to-noise ratio among different CT systems with comparable acquisition parameters. Our study found that the first-order radiomics features have less variation than texture, which is in accordance with the previous phantom studies [30,31,32,33]. It is related to the fact that the texture features might be changed by image quality differences among CT systems. In addition, the pre-processing parameters also have a great impact on the radiomic analysis. Although the influence of these parameters is out of the scope of this work, we believe the selection of the bin size is especially important in our study. The selection of bin size can significantly influence on the image noise and thereby impact on the radiomics features in DECT systems [30]. However, the images from PCD-CT systems are less likely to be influenced by the bin size, since the PCD-CT systems are capable of removing the electronic background noise [57]. This should be investigated in future study to improve the reproducibility between different CT systems [60, 61].

It is believed that the radiomics features may be benefited from higher spatial resolution, higher contrast-to-noise ratio, and improved detection of lower-energy photons of PCD-CT system for better pathology characterization [34,35,36]. A phantom study of pulmonary nodules indicated that the estimation of morphological features may be improved in PCD-CT than in conventional CT systems [36] since the higher resolution in PCD-CT system allows better delineation of the nodule. Another organic phantom study showed a great difference of more than fifty percent were identified in 13 out of 14 selected radiomics features between PCD-CT and dsDECT systems [34]. On the other hand, a patient study compared radiomics features of non-scarred left ventricular myocardium suggested that first-order features were nearly comparable between PCD-CT and dsDECT systems, but texture features would be strongly changed [35]. Our study compared PCD-CT with four DECT systems and extended results that the radiomics features were not expected to be comparable between PCD-CT and DECT systems. We considered that the repeatability and reproducibility of radiomics features may not be substantially changed in some of the acquisition parameters such as radiation dose levels [25, 27, 34, 37]. In contrast, the acquisition and reconstruction parameters such as material decomposition technique, spatial resolution, and reconstruction kernels, have substantial impact on the radiomics feature values and result in a great decrease in the robustness of radiomics features [26, 27, 34, 35, 38, 39]. It is necessary to investigate the influence of reconstruction parameters on radiomics features within the PCT-CT system. However, it remains unknown whether the phantom experiments can reflect the reproducibility of radiomics features in clinical patient scans with various protocols. The agreement between phantom and patient experiments in the context of radiomics may be limited due to the difference in texture and the use of contrast media [38].

The following limitations of our study should be addressed. First, this was a phantom study without validation of human data. The texture of phantoms may differ from the physiological human parenchyma or pathological tissues [29, 41, 62]. Further, it is still unclear whether the low reproducibility will damage the presentation of the biological phenomenon in a clinical study. It may be not important if the change in radiomics features is significant enough between different biological phenomena. Therefore, the results may not be directly transferrable to patients. An improved organic phantom model with a specific disease would be preferable. However, our study gives an important insight into the variation of radiomics derived from images using heterogeneous CT techniques, as patient data of multiple scans on different CT systems are not always available [29]. Second, we did not include DECT systems using the split filter technique, and only one PCD-CT system was included in our study. These issues should be addressed in further studies to show whether the PCD-CT systems allow more stable radiomics features than traditional energy-integrating CT systems [34,35,36] and whether the radiomics analysis among PCD-CT systems is more generalizable. Third, we did not include traditional SECT systems in our study. This may reduce the translational value of our work. The DECT-like 70 keV VMI is recommended by the vendor for clinical use in abdominal scans instead of SECT-like low-energy-threshold polychromatic images [30,31,32,33]. Therefore, the 70 keV VMI from the PCD-CT was selected as the reference in our study. Accordingly, we chose the 70 keV VMI from DECT systems for comparison. In our future study, we will compare the SECT-like low-energy-threshold polychromatic images with the traditional SECT images in the terms of radiomics features. Fourth, the lowest radiation dose level in this study is relatively high. We did not test the stability of radiomics features at extra low radiation dose levels to present the advantage of PCD-CT systems providing stable quantification without disruption from electronic background noise [56]. It is expected that the PCD-CT systems allow reliable quantification within a wider range of radiation dose levels. Fifth, only VMI at 70 keV has been compared in this study. Both PCD-CT and DECT systems allow post-processing of material decomposition and linear blending to generate VMIs at different keV levels, iodine mappings, and virtual unenhanced images [38, 39]. The robustness of radiomics features derived from these images is also of interest because it is notable that different CT techniques has influence on the quantification of iodine [53,54,55,56]. Sixth, we applied bi-dimensional ROIs instead of three-dimensional ROIs. The selection has a potential impact on reproducibility. However, the influence of bi-dimensional or three-dimensional ROIs may be relatively small [63] and does not change the conclusion of the current study. Finally, this study did not evaluate the relationship between radiomics robustness and characterization ability. It should be considered in later clinical studies, as the radiomics analysis based on PCD-CT systems may change the clinical interpretation or classification in pathologies with rich textures or small volumes [34, 37].

To conclude, this study outlined the variability of radiomics features derived from VMIs generated using one PCD-CT system and four traditional energy-integrating DECT systems, despite using comparable protocols. Different radiation dose levels did not substantially change radiomics features, while the repeatability of radiomics features was heterogeneous across CT techniques. Radiomics analysis based on one CT technique should not be directly transferred to others without validation. Future investigations are encouraged to mitigate radiomics variability due to CT techniques.