Introduction

Radiomics has now become an intensive field of research, based on the extraction and mining of several quantitative variables, which are referred to as radiomics features (RFs). RFs enable to screen extensively the shape and texture of objects of interests within medical images of any modality. In oncology, RFs have been used in predictive models based on machine-learning classifiers to discriminate benign and malignant lesions, identify molecular alterations in tumors, predict patients’ outcome, and even build radio-genomics signatures1,2,3. Regarding sarcomas, radiomics have improved predictions of grading, prognosis and response to chemotherapy/radiotherapy, based on CT-scans, structural MRI alone or combined with positron emission tomography, dynamic-contrast enhanced or diffusion MRI4,5,6,7,8,9.

Though one aim of radiomics is to provide an objective assessment of tumor phenotype, several studies have shown the influence of pre- and post-processing factors on the value of RFs10,11,12,13,14,15. These findings question the validity and reproducibility of inter-site radiomics studies. This issue is even more prominent with MRI because of the absence of standard intensity scale. Therefore, signal intensities (SIs) lack of comparability, even for a given sequence acquired on the same MR-scanner. Unlike gray-levels discretization or voxel-size standardization, technical details regarding homogeneization of SIs are frequently missing in materials and methods and, even when performed, assessment of the optimal setting for the MRI dataset of interest is often lacking.

Some intensity harmonization techniques (IHTs) have been proposed in the neuroimaging literature to enable robust analysis of structural and diffusion MRIs across different radiological centers and longitudinally, but most cannot be transposed to sarcomas because of the heterogeneity of tissues surrounding sarcomas, which are ubiquitous tumors. Available IHTs regarding non-brain MRI are scarce. The most frequently encountered are global scaling (e.g. where SIs values are centered by removing the mean and scaled to unit variance, or transformed to range between 0 and 1), ratio with SIs of a healthy tissue that is not affected by the disease (for instance adipose tissue or muscle in musculoskeletal imaging), or histogram-matching (HM, where the intensity histograms are transformed to match a reference intensity histogram)16,17,18. In addition, Orlhac et al. have recently shown that ComBat harmonization method, which was initially described in genomics to remove batch effect, could correct non biological differences related to the type of scanners19. Though the authors focused on CT-scanner, ComBat may help reduce unwanted variations in MRI-based radiomics datasets as well.

Thus, our aim was to investigate how the IHT could influence MRI-based radiomics analyses in a uniformly-treated cohort of soft-tissue sarcomas (STS) patients with which the presence of intra-tumor heterogeneity on initial T2-weighted-imaging (-WI) has been previously correlated with metastatic-relapse free survival (MFS)4,6,20. To do so, to comprehensively assess the impact of IHT on radiomics analyses, we investigated its influence on: (i) the RFs values; (ii) the prognostic value of radiomics-based unsupervised classifications; and (iii) the performances of supervised classifiers to predict early metastatic relapses.

Methods

Study population

This study was approved by the local Research Ethics Committee of Bergonié Insitute (Bordeaux, France) according to good clinical practices and applicable laws and regulations. All methods were performed in accordance with the relevant guidelines and regulations. The need for written informed consent was waived because of its retrospective nature.

Patients were consecutively recruited as they fulfilled the following inclusion criteria: newly-diagnosed, non-metastatic (according to chest CT-scan), histologically-proven high-grade STS of trunk wall or extremities (n = 163), treated with 4–6 cycles of anthracycline-based neoadjuvant chemotherapy and curative surgery at our sarcoma reference center from June 2006 to November 2016 (n = 133), available baseline MRI (n = 95) with axial spin-echo T2-WI without artefacts (n = 72), and available clinical and radiological follow-ups for at least 2 years after the surgery (n = 70). Follow-ups consisted in a clinical examination and chest radiograph every 3 months for 2 years, every 6 months for 5 years and annually until 10 years after surgery, which were complemented by chest CT-scans and MRIs in case of doubtful findings. All relapses were histopathologically confirmed. MFS was defined as the time since curative surgery to metastatic relapse.

MRI acquisition

The baseline MRI examinations were acquired on 3 different 1.5-T MR-systems (Philips Signa [17/70, 24.3%], Siemens MAGNETOM Aera [41/70, 58.5%], General Electrics Healthcare Optima Jem MR450w [12/70, 17.1%]) with adjustment of coils, field-of-view and matrix depending on tumor size, location and depth. Regarding T2-WI, the range of repetition and echo times were 2,400–4,500 ms and 70–130 ms, respectively. Slice thickness ranged from 3 to 5 mm. The protocol also systematically included 2D or 3D T1-WI after intra-venous gadolinium-chelates injection (with or without fat-suppression).

MRI post-processing (Fig. 1)

Figure 1
figure 1

Study pipeline. HM histogram matching, IHT intensity harmonization technique, No-IHT no use of IHT before extracting radiomics features, RF radiomics features, WI weighted imaging.

After anonymizing MRIs, the postprocessing was performed with R (version 3.5.3, Vienna, Austria) by using the “oro.nifti”, “ANTsR” and “extranstr” packages21.

First, T2-WIs were converted to nifti format. Voxel size resampling (with b-spline interpolator) and N4 bias field correction were applied to obtain a common spatial resolution of 1 × 1 × 4 mm3 and to correct non-uniform intensities22.

Second, a senior radiologist (A.C., with 4 years of experience in sarcoma imaging) manually segmented the whole tumor volume, slice-by-slice, by using LIFEx freeware (version 5.10, Inserm, Orsay, France, www.lifexsoft.org)23. The radiologist had access to all the other MRI sequences to adjust the boundary of the segmentation if needed. The volumes of interests were all validated by a second senior radiologist (M.K., with 28 years of experience in sarcoma imaging).

Third, 4 IHTs were applied in parallel to the whole imaging dataset in order to harmonize the SIs of the T2-WI, providing 4 harmonized datasets, i.e.:

  1. (1)

    IHTfat, which consisted in dividing all the SIs of a given T2-WI by the mean SI of adipose tissue on that T2-WI, as follows:

    $${\mathrm{SI}\left(\mathrm{x},\mathrm{y},\mathrm{z}\right)}_{\mathrm{IHT}-\mathrm{fat}}=\frac{\mathrm{ SI}(\mathrm{x},\mathrm{y},\mathrm{z}) }{\mathrm{mean}(\mathrm{SI}(\mathrm{adipose tissue}))}$$

    where x, y and z are the coordinates of a voxel. To do so, the first senior radiologist segmented a volume of at least 10 cm3 of pure normally-appearing adipose tissue on each T2-WI in order to extract the mean SI per patient.

  2. (2)

    IHTstd, which consisted in normalizing the SIs of a T2-WI according to the minimum and maximum of all voxels included in this T2-WI, as follows:

    $${\mathrm{SI}(\mathrm{x},\mathrm{y},\mathrm{z})}_{\mathrm{IHT}-\mathrm{std}}=\frac{\mathrm{SI}\left(\mathrm{x},\mathrm{y},\mathrm{z}\right)-\mathrm{min}(SIs)}{\mathrm{max}\left(SIs\right)-\mathrm{min}(SIs)}$$
  3. (3)

    IHTHM.1, which consisted in performing a matching of the intensity histogram of each T2-WI with the intensity histogram of a same normalized T2-WI from the same randomly chosen patient in the MRI dataset. This technique is achieved in 2 stages: first, a pre-specified number of percentiles and a reference image are given to the algorithm and, second, the new image is transformed according to several linear mapping of the SIs (depending on the number of landmarks) in order to match to the reference image (details about the conversion of SIs are given in Supplementary Data 1) (https://github.com/abdhigithub/hatch).

  4. (4)

    IHTHM.All, which consisted in performing a matching of the intensity histogram of each T2-WI with the average intensity histogram of the whole normalized MRI dataset.

IHTHM.All and IHTHM.1 were trained on 100 histogram landmarks as a compromise between postprocessing time and image quality but other numbers of landmarks were tried (Supplementary Data 1). The superimposed SIs distributions of the 70 patients depending on the IHT are given in Supplementary Data 2.

Radiomics features extraction

The tumor volumes were then propagated on the 4 post-processed imaging datasets (IHTfat, IHTstd, IHTHM.1 and IHTHM.All) and on the imaging dataset without IHT (named No-IHT) enabling the extraction of 5 datasets of 45 3-D RFs by using LIFEx software. SIs were previously discretized into 128 fixed bins. Thirteen histogram-based and 32 s-order texture features from grey-level co-occurrence matrix (GLCM, n = 7—with a 1-voxel distance to neighbors), grey-level run length matrix (GLRLM, n = 11), neighborhood grey-level different matrix (NGLDM, n = 3) and grey-level zone length matrix (GLZLM, n = 11) were calculated (details are giving in Supplementary Data 3).

ComBat compensation

We applied the ComBat-Harmonization function in R (https://github.com/fortin1/ComBatHarmonization) to the 45 RFs that were extracted from the IHTHM.All dataset with a non-parametric setting in order to remove unwanted noise due to technical variations between the 3 MR-systems of the study while preserving biological variability, and notably when there are only a few patients per site19,24,25. ComBat-Harmonization is classically applied at the end of the postprocessing pipeline, herein, after the extraction of RFs obtained with the IHT that was hypothesized to be the more relevant and realistic among the 5 IHTs (namely IHTHM.All). This data-driven method identifies the protocol effect assuming that the value of each feature, RF, measured in a volume-of-interest, (x,y,z), with an imaging protocol, i, can be written as: RFi,(x,y,z) = α + γi + δi × εi,(x,y,z) (in which α is the average value for features yij; γi is an additive protocol effect and δi is a multiplicative protocol effect affected by an error term εij). The compensations consists in estimating the model parameters α, γi and δi, and by using a maximum likelihood approach on the basis of the set of available observations:\({RF}_{i,v(x,y,z)}^{ComBat}=\widehat{\alpha }+ \frac{{RF}_{i,v(x,y,z)}-\widehat{\alpha }-{\widehat{\gamma }}_{i} }{{\widehat{\delta }}_{i}}\), in which \(\widehat{\alpha }\), \({\widehat{\gamma }}_{i}\) and \({\widehat{\delta }}_{i}\) are estimators of α, γi and δi. Parametric and non-parametric forms of ComBat-Harmonization have been developed. The non-parametric form does not assume law followed by the parameters and has been used in the present study.

The resulting RFs were labelled IHTHM.All.C. In total, six paired datasets of 45 RFs were obtained, namely: No-IHT, IHTfat, IHTstd, IHTHM.1, IHTHM.All and IHTHM.All.C.

Statistical analysis

Statistical analysis was performed with R. All tests were two-tailed. A p-value of less than 0.05 was deemed significant. A 3-steps approach was performed to evaluate the impact of IHTs on each aspect of radiomics studies (Fig. 1):

  1. (1)

    Per-RF analysis: RFs were all normalized in order to range between 0 and 1 and to facilitate direct comparisons. For each RF, the influence of the IHT was evaluated with one-way repeated-measures ANOVA. Post-Hoc comparisons were assessed with Tukey test and Bonferroni corrections. Intraclass correlation coefficients (ICC) were estimated for each RF, with a 2-way random model, agreement between raters and 6 raters (“irr” package).

  2. (2)

    Unsupervised analysis: A hierarchical clustering analysis with the Ward method was applied on each of the 6 subsets of RFs. RFs were centered and scaled by mean beforehand and the Euclidean distance between each pair of patients was computed. Visual inspection of the silhouette plot enabled to select 2 clusters of patients for each harmonization technique. We calculated the Baker’s gamma coefficient between each pair of dendrograms (dendextend” package), and the Kappa index between each pair of clustering results, which enabled the quantification of their divergence depending on the IHT26.

    The correlations between MFS and the clusters yielded by the models were assessed with Kaplan–Meier analysis and multivariable Cox models—after adjustment to the classical confounding covariables for sarcomas, i.e.: the longest baseline diameter (< vs. ≥ 10 cm), performance status (0 vs. 1–2), histological type (undifferentiated sarcomas vs. other), number of chemotherapy cycles (4 vs. 5–6), chemotherapy type (anthracycline-ifosfamide vs doxorubicine), adjuvant radiotherapy, surgical margins (R0 vs. R1-R2) and histological response (goods vs. poor responder to chemotherapy with a cut-off of 10% viable cells on post-chemotherapy surgical specimen). Prognostic performances of the 6 multivariate models were evaluated and compared through concordance-indices, which estimate the models’s ability to provide a reliable ranking of the survival times based on the individual risk scores.

  3. (3)

    Supervised analysis: The same supervised machine-learning approach was applied to the 6 datasets of RFs in order to predict the occurrence of a metastatic relapse within 2 years after curative surgery by using the “caret” and “glmnet” packages27,28. The total population of 70 patients with available clinical and radiological follow-up was randomly subdivided into one training cohort of 50 patients and one testing cohort of 20 patients with the same proportion of metastatic relapses by using the createDataPartition function. The training cohort was used to train a binomial logistic regression with combination of least absolute shrinkage and selection operator (LASSO) and ridge penalizations (elasticnet-LR). This algorithm consists of reducing the number and the importance of explanatory variables in order to optimize the performances of the classification model. The coefficients of the less contributive variables are shrunken towards 0 (: ridge regression) or even set to 0 (: LASSO). The amount of ridge and LASSO penalization was investigated by using a manual grid search with two hyperparamètres: α (mixing percentage) and λ (regularization parameter) and tenfold cross validation, repeated 5 times. The same partitioning of patients was used for the 6 datasets. The same clinical and pathological covariables as in the unsupervised analysis were included, in addition to the same 3 shape RFs (volume, compacity and sphericity—which are independent from the IHT).

The performances of supervised models were evaluated through cross-validated accuracy and area under the ROC curves (AUROC) with 95% confidence interval (95%CI). To do so, we extracted the 5 × 10 = 50 estimations of the accuracy and AUROC from the 50 distinct test sub-cohorts of 5 patients from the training cohort, and we applied the CI function from the Rmisc package to these vectors. Finally, for each RFs dataset, the final model with the highest AUROC in cross-validation was used on the testing cohort to estimate the AUROC and accuracy.

Results

Thirty-two of the 70 patients (45.7%) were women with a median age of 58 (range: 19–84) (Table 1). The most frequent histological types were high-grade undifferentiated sarcomas (31/70, 44.3%), with a median size of 116 mm (range 40–273) and mostly deep-seated in the lower limb (35/70, 50%).

Table 1 Clinical and pathological features of the study population.

Per-RF analysis

The influence of IHT was significant for all RFs (p-values range: < 0.0001–0.02, Supplementary Data 4). All significant differences in the RFs comparisons between each pair of post-processing techniques are listed in Table 2. The highest and lowest amounts of differences were obtained for post-hoc comparisons between IHTHM-All and IHTfat (31 statistically different RFs out of 45, 68.9%) and IHTHM.All and IHTHM.1 (6/45, 13.3%), respectively.

Table 2 Summary of the per-radiomics features (RFs) analysis.

Figure 2 shows the 45 ICCs in descending order. The highest ICCs were reached with GLRLM_RLMNU, GLRLM_GLNU and GLCM_Correlation (≥ 0.95). The lowest ICCs were reached with GLZLM_ZLNU, GLZLM_LZE, HISTO_maximum, GLZLM_LZLGE and HISTO_minimum (< 0.20).

Figure 2
figure 2

Intra-class correlation coefficients (ICC) of the radiomics features (RFs) depending on the intensity harmonization technique (IHT). Results are given with 95% confidence interval.

Unsupervised analysis

All 6 unsupervised classifications achieved were different. Table 3 shows the correlation matrices for Kappa indices and Baker coefficients. The pair of clustering with the highest positive correlation was obtained with IHTHM.All versus IHTHM.All.C (Kappa = 0.75, Baker coefficient = 0.55). The lowest correlated pair was obtained with No-IHT versus IHTHM.1 (Kappa = 0.18, Baker coefficient = 0.05). Both correlated dendrograms are displayed in Fig. 3.

Table 3 Comparisons of the different dendrograms obtained by hierarchical clustering of the radiomics features with the 6 datasets depending on the intensity harmonization technique (IHT). (a) Corresponds to the Cohen’s Kappa index ranging from 0 (completely different clustering assignements) to 1 (exactly the same clustering assignements). (b) Corresponds to the the Baker’s gamma coefficient ranging from 0 (completely different dendrograms) to 1 (exactly the same two dendrograms).
Figure 3
figure 3

Comparisons of the hierarchical clustering results based on radiomics features from different datasets depending on the intensity harmonization technique (IHT) with: (a) the highest divergence, and (b) the lowest divergence. The dendrograms were obtained according to the following IHTs: histogram matching (HM) with a randomly-chosen normalized histogram of a patient (IHTHM1) versus no use of harmonization technique (No-IHT); and HM with the average normalized histogram of the study population (IHTHM.All) versus IHTHM.All combined with ComBat harmonization method (IHTHM.All.C). By convention, cluster-1 (in blue) corresponds to the group of patients with the best prognosis regarding metastatic-relapse free survival.

Regarding the prognostic value of the clusters, our univariate analysis showed that significantly different survivals were found with the clusters obtained with the IHTHM.All radiomics dataset (Log-rang p-value = 0.03) but not with the other IHTs. Kaplan Meier curves for the 6 clustering analyses are given in Fig. 4.

Figure 4
figure 4

Kaplan–Meier curves for metastatic-relapse free survival depending on unsupervised clustering results based on radiomics features obtained with the different intensity harmonization techniques (IHT) or no use of harmonization technique (No-IHT).

To assess the prognostic values in presence of confounding variables, we elaborated multivariate models demonstrating that the clusters obtained with RFs from the No-IHT, IHTstd, IHTHM.All and IHTHM.All.C were independently associated with MFS in the multivariate modeling (p = 0.02, 0.007, 0.004 and 0.02, respectively—Table 4) but not the clusters obtained with RFs from the IHTfat and IHTHM.1. Concordance-indices of the 6 prognostic models ranged from 0.71 (95% CI 0.67–0.75) for IHTHM.1 to 0.75 (95% CI 0.70–0.79) for No-IHT, IHTHMstd and IHTHM.All. The concordance-index of a reference prognostic model taking into account the clinical and pathological confounding co-variables alone was of 0.71 (95% CI 0.67–0.75).

Table 4 Unsupervised analysis based on radiomics features (RFs)—Prognostic value of the clustering results depending on the intensity harmonization technique (IHT).

Supervised analysis

In total, there were 29/70 (41.4%) metastatic relapses within the first two years of follow-up, which were distributed into 21/50 (42%) events in the training cohort and 8/20 (40%) events in the validation cohort.

The final hyperparameters and performances of the classification models are given in Table 5. The best performances in repeated cross-validation were found with the models based on the RFs from the IHTHM.All and IHTHM.1 datasets (AUROC = 0.71, 95% CI 0.66–0.76, and 0.69, 95% CI 0.64–0.74, respectively). The lowest AUROC was obtained with the No-IHT dataset (0.57, 95% CI 0.52–0.63).

Table 5 Accuracy and area under the ROC curves (AUROC) of the supervised models in repeated cross validation (training cohort) and in the testing/validation independent cohort, depending on the 5 intensity harmonization techniques (IHTs) or the lack of IHT (named No-IHT).

In descending orders, the AUROCs on the testing cohort were 0.82 (95% CI 0.59–1) with IHTHM.1, 0.80 (95% CI 0.56–1) with IHTfat, 0.77 (95% CI 0.52–1) with IHTHM.All, 0.76 (95% CI 0.50–01) with No-IHT, 0.71 (95% CI 0.444–0.973) with IHTHM.All.C, and 0.69 (95% CI 0.41–0.56) with IHTstd. AUROCs of the most and less performant models and the No-IHT model in the testing cohort are shown in Fig. 5. The number of radiomics features included in the final models ranged from 3 (with No-IHT and IHTHM.AllC) to 21 (with IHTfat). Regarding the best final model, namely IHTHM.1, the number of selected radiomics features was of 7 out of 48 possible (by including the 3 shape features). Among these features, HISTO_Quartile1 and GLZLM_SZLGE were the most frequently selected (in 5 out of 6 models, and 4 out of 6 models, respectively) (Supplementary Data 5).

Figure 5
figure 5

ROC curves for the best and worse supervised models to predict metastatic relapse within 2 years after the end of initial treatment in the testing cohort (built on the radiomics features from the IHTHM.1 and IHTstd datasets, respectively). The ROC curve of the final model without using harmonization technique (No-IHT) is also shown for comparison.

Discussion

The post-processing of medical images to perform radiomics studies is mandatory to ensure the comparability of multicentric datasets but it can result in additional bias that may alter the performances of predictive models and preclude the reproducibility of MRI-based radiomics signatures. Because structural MRIs are acquired in arbitrary units, the intensity harmonization is crucial to enable the comparability of examinations acquired with different MR-systems, coils, and acquisition parameters. We found that all 45 textural features widely used in the literature were significantly influenced by IHT. Furthermore, depending on the IHC used, the results of unsupervised and supervised analyses based on RFs and their clinical correlations were dramatically changed. In addition, using an inappropriate IHT could decrease the performances of radiomics-based predictive models as it was highlighted by the comparative analysis with the models built with the No-IHT imaging dataset.

Our results concur with previous studies that found a significant influence of other post-processing steps on the absolute values of RFs (such as voxel size standardization, gray-levels discretization or manual segmentation) in addition to pre-processing steps (such as magnetic field strength, manufacturers, coils, acquisition parameters or filters). Recently, Scalco et al. found that the IHT for T2-WI had a significant impact on the reproducibility of RFs and on the inter-observer reproducibility of RFs that were extracted from pelvic organs from two MRIs separated by months29. These findings have been also applied to other IHTs such as variants of HM and a home-made method taking into account the SIs of organs of interest, the prostate, but the authors focused on the image, histogram and RFs values and not on RF-base predictions30. To our knowledge, this study is the first to demonstrate the dramatic impact of IHTs on RF-based predictions.

Moreover, in a recent review of MRI-based sarcoma radiomics studies, we found that 17 out 31 (54.8%) did not mention the method used for making comparable the SIs of MRI dataset (under review). It should be emphasized that the current Image Biomarker Standarisation Initiative and Radiomics Quality Score lack of precise guidelines regarding IHT for MRI31.

Previous studies have already emphasized the influence of IHT on segmentation and tissue classification tasks but they mostly involved brain MRI for inflammatory or degenerative diseases, and not specifically study their influence on radiomics analyses24,25,32,33. Moreover, the methods proposed in these studies were not readily transposable to non-brain imaging and/or not available in open source language (for instance, DeepHarmony)34.

In this study, we focused the analyses on techniques previously used in the body-imaging radiomics literature (i.e. scaling, histogram-matching or ComBat-Harmonization) but further studies should consider translating other popular intensity harmonization algorithms to body MRI. The RAVEL algorithm, which aims at estimating a voxel-specific unwanted variation by using a control region (i.e. brain cerebro-spinal fluid), may be particularly promising if applied to body-MR, with the possible use of healthy adipose tissues as control in the setting of soft tissue sarcomas for example24,25. Alternatively, instead of a post-processing intensity harmonization, the harmonization of SIs could be achieved since the acquisition step, through the use of standardized T1-mapping or T2-mapping sequence. However, thousands of MRIs have already been stored and, logically, the radiological community expects to pool and include these images in retrospective radiomics studies.

None of the IHTs used in this study demonstrated an unequivocal superiority compared to the others. This observation lets us hypothesize that the “best” technique is not universal but may actually vary depending on the dataset and the study objectives. Our present data does not allow us to validate this hypothesis, as it would require additional datasets to test if the same IHT constantly provides the best models whatever the disease and the outcome. While the unsupervised analysis highlighted the prognostic value of clusters elaborated with RFs from the IHTstd, IHTHM.All and IHTHM.All.C datasets, the supervised analysis emphasized on the other hand the prognostic value of other models elaborated with RFs from the IHTfat and IHTHM.1 in the testing cohort. It is worth noting that our supervised models showed moderately higher performances in the validation cohort than in the training cohort (range of differences: 0.03–0.13). Although this finding suggests that the models were not overfitted, it also indicates that the training could have been premature (despite the use of repeated cross-validation and exhaustive grid search) and that a sampling bias could have occurred during the data partitioning in our rather small study population (despite the fact that the splits were obtained randomly and were well-balanced regarding the outcome).

Importantly, our unsupervised analysis revealed that using an inappropriate IHT could even lead to a total loss of relevant information from the radiomics data. Indeed, the concordance indices of the reference model (which was elaborated with clinical and radiological variables alone) and the model relying on IHTHM.1 were equivalent, which stresses the lack of prognostic value of the corresponding clusters. Similarly, although the lowest AUROC was reached with the No-IHT dataset in cross-validation, the performances of this supervised model were not markedly different from those obtained with some of the IHTs in the two cohorts (especially the IHTstd). These findings also suggest that radiomics studies should investigate all the available IHTs in an exploratory subset of the cohort, as well as no use of IHT, and subsequently select the one that optimizes the predictions. For instance, the extraction of RFs according to various voxel sizes and/or numbers of gray levels is commonly performed in radiomics studies. By analogy, one could consider extracting the RFs according to different IHTs and select the most robust and predictive RFs at univariable level. Hence, the intensity harmonization techniques could be considered as a “hyperparameter” of the post-processing pipeline. Interestingly, IHTHM.All.C yielded moderately good performances in both unsupervised and supervised analyses (with similar results in training and testing cohorts), which suggests that this method may provide the more realistic radiomics data in the setting of our study. It should be emphasized that the co-variable arguments given to the ComBat function may/might be incomplete in the setting of sarcomas. In any case, the clinical outcome of the study should not be included among the ComBat covariables because it should not depend on the MR-system or acquisition parameters of the sequences. A distinctive feature of sarcomas over other cancers is their anatomical ubiquity, hence, requiring adjusting several other acquisition parameters depending on the tumor location (for instance thoracic wall, thigh or wrist). Further studies should investigate the best co-variables for ComBat for non-brain MRI. In addition, ComBat could have been used with the No-IHT, IHTfat, IHTstd, IHTHM.1 radiomics features. We purposely decided to limit the application of ComBat to only one dataset (IHTHM.All) to avoid multiplying the post-hoc analyses, performances measurements, or superposing ROC curves, while our current results already enables us to stress the strong impact of IHT on radiomics-features and radiomics-based classifications and predicitions.

Our results also deepened that intra-tumoral heterogeneous SIs on T2-WI is predictive of MFS in a quantitative manner and other studies have also correlated this parameter with overall and/or metastatic-relapse free survivals in STS patients with relatively close and similar performances to ours6,7,20. Indeed, Peeken et al. used an equivalent of IHTstd and applied ComBat to correct for multicenter effect. They also provided the sarcoma histological type as a biological covariable (which slightly improved the performances)6. Their best model relied on radiomics features from Fat Sat T2 weighted imaging and showed a concordance-index of 0.74 in the validation cohort. On the other hand, Spraker et al. did not explicitly use an intensity harmonization technique, neither ComBat7. Interestingly, their best clinical and radiological prognostic models for the overall survival showed a concordance-index of 0.78 in the validation cohort.

Our study has limits. First, the study population was relatively small although this is the largest study investigating IHT and radiomics. It should be noted sarcoma radiological studies rarely exceed our population number. Second, we focused this proof-of-concept methodological study on T2-WI sequences but further investigations should be performed on other MRI sequences, such as T1-WI, contrast-enhanced T1-WI, DCE-MRI and diffusion imaging. We purposely chose this sequence because it is commonly reported as the most informative morphological sequence for sarcomas8,20. Third, our study design could be criticized. Indeed, judging which of the IHTs is the best by using the performances of predictive models (AUROC or concordance-index) as judgment criteria can only be valid if the intrinsic prognostic value of MRI-based radiomics features is certain. In this case, lowering these performances with a particular IHT would mean that this IHT caused noise and inappropriate deviation in the data. However, as already stated, prior studies converged towards same results regarding the relationship between MRI-based radiomics features, heterogeneity on T2-WI and outcomes of sarcoma patients6,7,20,35. Alternative study designs could have been proposed in the absence of such relationship, (i) either by using a phantom made of compartments with various degrees of heterogeneity, (ii) or by using MRIs of healthy volunteers covering organs with different textures and investigating which IHT enables the best radiomics-based classification of these organs (by analogy with the study by Orlhac et al.)19. Fourth, other shape and textural RFs than the 48 features used in this study can be encountered in the literature. Yet, we purposely decided to limit our investigations to this set of RFs, which are proposed by the LIFEx freeware, as they follow the definitions of the Imaging Biomarker Standardization Initiative23,31. Furthermore, adding more potential radiomics predictors in our multivariate analyses would have increased the multidimensionality of our dataset and the risk of overfitted results regarding the limited number of patients.

To conclude, through the example of sarcomas, our study highlights that the IHT can directly influence the values of MRI-based RFs, subsequently leading to dramatical changes in the predictions of both unsupervised and supervised models. Therefore, IHTs need to be deepened regarding non-brain MRI and should be carefully explored and detailed when building radiomics models to ensure the robustness and reproducibility of radiomics signatures.