Introduction

Atypical lipomatous tumor (ALT) and lipoma are the most common soft-tissue lesions [1]. According to the 2020 edition of the World Health Organization classification [2], the term ALT is reserved for low-grade adipocytic neoplasms arising at anatomical sites for which surgery is generally curative, including the extremities and trunk [2]. ALTs have a relatively indolent disease course compared to well-differentiated liposarcomas, namely lipomatous lesions with the same histology but located in deep anatomical sites such as the retroperitoneum, mediastinum, and spermatic cord, where there is a higher risk for recurrence and dedifferentiation related to lower chances of achieving negative surgical margins [2]. In line with this relatively indolent clinical behavior, treatment strategy has progressively shifted from extensive surgery to marginal excision in ALTs, which is now considered an appropriate option to achieve local control while taking into account the morbidity rates associated with surgery [3]. On the other hand, lipomas are benign lipomatous lesions, which do not require any treatment unless symptomatic or due to cosmetic concerns [3]. Lipomas are rare in deep locations, such as the retroperitoneum, but very common in the extremities and trunk [1]. Thus, an accurate distinction between ALT and lipoma is desirable to offer optimal patient care.

In the diagnostic pathway of lipomatous soft-tissue lesions, magnetic resonance imaging (MRI) is the imaging method of choice for diagnosis and differentiating ALT from lipoma, with high sensitivity and substantial specificity [4,5,6]. In detail, according to a recent meta-analysis, the sensitivity and specificity of radiologists evaluating multiple combined imaging parameters (called “radiologist gestalt”) range from 76 to 100% and 37 to 77%, respectively, if only studies focusing on lipoma and ALT are considered [4]. Nonetheless, a certain degree of interobserver variability has emerged even among expert readers [5,6,7], with kappa values ranging from 0.23 to 0.7 according to this meta-analysis [4]. Preliminary imaging studies applying radiomics have shown promise for improving diagnostic accuracy and characterizing lipomatous soft-tissue lesions more objectively [8]. Radiomics includes the extraction and analysis of quantitative parameters from medical images, known as radiomic features [9,10,11]. A crucial step of radiomic workflows is feature reproducibility assessment, as these quantitative parameters may suffer from interobserver variability, particularly regarding tumor delineation while performing manual segmentation [12,13,14,15]. Segmentation margins are also critical because the peritumoral area may influence the reproducibility of radiomic features and their diagnostic performance [15, 16]. Furthermore, in radiomic workflows, the effects of different image intensity discretization methods on feature reproducibility are debated [17,18,19]. In literature, the intraclass correlation coefficient (ICC) is commonly employed to evaluate radiomic feature reproducibility [16, 20,21,22,23].

The aim of this study is to investigate the influence of interobserver manual segmentation variability on the reproducibility of MRI-based radiomic features in lipoma and ALT, also considering the impact of different image intensity discretization methods.

Materials and Methods

Design and Population

Institutional Review Board approved this retrospective study and waived the need for informed consent. This study was designed to meet the numerical requirements of a reproducibility analysis in terms of patients and readers involved, namely 30 lesions and 3 different readers, according to the ICC guidelines by Koo and Li [24]. An electronic search of the pathology information system was performed, and 30 patients with lipomatous soft-tissue tumors were included (median age 58 [range 40–79] years). Inclusion criteria were as follows: (i) lipoma or ALT proven by post-surgical pathology, which was based on microscopic findings and MDM2 immunohistochemistry or fluorescence in situ hybridization; (ii) 1.5-T MRI performed within 3 months before surgery, including turbo spin echo T1-weighted and T2-weighted sequences without fat suppression. Exclusion criteria were ALT local recurrence and poor image quality or image artifacts affecting segmentation and radiomic analysis.

Details regarding location, size, and main imaging characteristics of the included lipomas and ALTs are provided in Table 1. All examinations were performed on one of two 1.5-T MRI systems (Magnetom Avanto or Magnetom Espree, Siemens Healthineers, Erlangen, Germany). Axial T1-weighted and T2-weighted MRI sequences were extracted for image analysis. The median matrix size and slice thickness were 512 × 512 (range 320–512 × 216–512) and 3.5 (range 3–5) mm, respectively. The median TE and TR were 11 (range 10–21) and 663 (range 454–800) ms on T1-weighted sequences, respectively. The median TE and TR were 100 (range 80–146) and 3664 (range 2000–7444) ms on T2-weighted sequences, respectively. All extracted DICOM images were converted to the NiFTI format prior to the analysis.

Table 1 Location, size and main imaging characteristics of the ALTs and lipomas included in this study

Image Segmentation

A musculoskeletal radiologist with 4 years of experience in musculoskeletal tumor imaging (S.G.), a general radiologist (V.G.), and a medical resident (J.B.) independently performed manual image segmentation using the open-source software ITK-SNAP (v3.8) [25]. The readers knew the study would deal with lipomatous soft-tissue tumors, but they were blinded to any additional information regarding pathology or disease course. Manual contour-focused segmentation was performed by drawing a region of interest (ROI) slice by slice to include the whole tumor volume on both axial T1-weighted and T2-weighted MRI sequences. Thereafter, margin shrinkage segmentation was computed by applying a marginal erosion to evaluate the influence of segmentation margins on feature reproducibility (Fig. 1). In detail, ROI shrinkage was performed using the fslmaths erosion function of the FMRIB Software Library [26]. The default kernels, namely a 3 × 3 × 3 box centered at the target voxel, were employed.

Fig. 1
figure 1

The upper and lower rows present two different examples of lesion annotation. These include the original images (a, f) with corresponding contour-focused segmentation presented as a mask (b, g) and relative 3D volume (c, h). Finally, the results of automated margin shrinkage are shown for both the mask (d, i) and volume (e, j)

Radiomic Analysis

Image pre-processing and feature extraction were performed using PyRadiomics (v3.0.1) [27], an open-source Python software. Image pre-processing consisted of resampling to a 2 × 2 × 2 isotropic voxel, intensity normalization (mean value of 300 and standard deviation of 100) and discretization with both options of fixed bin number and fixed bin width, as implemented in PyRadiomics. In detail, discretization was obtained using both a fixed bin number of 64 and a fixed bin width of 7.

Original images were used for extraction of first-order, shape-based and texture features, which were grouped according to PyRadiomics official documentation (https://pyradiomics.readthedocs.io/en/latest/features.html) and included: 18 first-order features, 14 shape-based features, 22 Gy-level cooccurrence matrix (GLCM) features, 16 Gy-level size zone matrix (GLSZM) features, 16 Gy-level run length matrix (GLRLM) features, 14 Gy-level dependence matrix (GLDM) features, and 5 neighboring gray tone difference matrix (NGTDM).

In addition to the original images, Laplacian of Gaussian (LoG)–filtered (sigma = 2, 4, 6) and wavelet-transformed images (all possible low and high pass filter combinations) were obtained for extraction of first-order and texture features. Shape-based features are independent from gray-level value distribution and therefore were only computed on the original images. A total of 1106 features were extracted from original, LoG-filtered, and wavelet-transformed images for each MRI sequence.

Statistical Analysis

Interobserver reliability was assessed using two-way, random-effects, single-rater agreement ICC 95% confidence interval (CI) lower bound. Features were considered stable when achieving good (0.75 ≤ ICC 95% CI lower bound < 0.9) to excellent (ICC 95% CI lower bound ≥ 0.9) interobserver reliability [24]. Differences among stable feature rates were evaluated using chi-square test. Differences among ICC 95% CI lower bound values were evaluated using Friedman test for repeated samples and Wilcoxon signed rank test with continuity correction for pairwise comparisons. A two-sided p-value < 0.05 indicated statistical significance [28]. Data analysis was performed using the pandas and numpy Python software and the “irr” R package [29, 30].

Results

Stable Feature Rates by Intensity Discretization Method and Segmentation Approach

After implementing image intensity discretization with fixed bin number, in contour-focused vs. margin shrinkage segmentation, the stable feature rates were 95.21% (n = 1053) vs. 95.66% (n = 1058) and 92.68% (n = 1025) vs. 90.69% (n = 1003) for T1-weighted and T2-weighted images, respectively, with no statistical difference (p = 0.298). In Fig. 2, box and whisker plots show the interobserver reproducibility of feature classes derived from contour-focused and margin shrinkage segmentations, grouped according to image type and MRI sequence. The matching stable features derived from contour-focused and margin shrinkage segmentations performed on T1-weighted and T2-weighted images were 92.68% (n = 1025) and 86.80% (n = 960), respectively, as detailed in Supplementary Files 1–2.

Fig. 2
figure 2

Contour-focused (original ROI) vs. margin shrinkage (eroded ROI) segmentation after image intensity discretization with fixed bin number. Box and whisker plots show the interobserver reproducibility of feature classes grouped according to image type and MRI sequence. GLCM, gray-level cooccurrence matrix; GLDM, gray-level dependence matrix; GLRLM, gray-level run length matrix; GLSZM, gray-level size zone matrix; ICC, intraclass correlation coefficient; LoG, Laplacian of Gaussian; NGTDM, neighboring gray tone difference matrix

After implementing image intensity discretization with fixed bin width, in contour-focused vs. margin shrinkage segmentation, the stable feature rates were 97.65% (n = 1080) vs. 95.39% (n = 1055) and 95.75% (n = 1059) vs. 96.47% (n = 1067) for T1-weighted and T2-weighted images, respectively, with no statistical difference (p = 0.175). In Fig. 3, box and whisker plots show the interobserver reproducibility of feature classes derived from contour-focused and margin shrinkage segmentations, grouped according to image type and MRI sequence. The matching stable features derived from contour-focused and margin shrinkage segmentations performed on T1- and T2-weighted images were 94.30% (n = 1043) and 93.76% (n = 1037), respectively, as detailed in Supplementary Files 3–4.

Fig. 3
figure 3

Contour-focused (original ROI) vs. margin shrinkage (eroded ROI) segmentation after image intensity discretization with fixed bin width. Box and whisker plots show the interobserver reproducibility of feature classes grouped according to image type and MRI sequence. GLCM, gray-level cooccurrence matrix; GLDM, gray-level dependence matrix; GLRLM, gray-level run length matrix; GLSZM, gray-level size zone matrix; ICC, intraclass correlation coefficient; LoG, Laplacian of Gaussian; NGTDM, neighboring gray tone difference matrix

In image intensity discretization with fixed bin number vs. fixed bin width, the latter discretization method yielded higher rates of stable features regardless of the segmentation approach (p < 0.001). Tables 2, 3, 4 and 5 show the number and percentage of stable features that were obtained with different combinations of discretization methods and segmentation approaches, grouped according to feature class and image type.

Table 2 Discretization with fixed bin number and contour-focused segmentation. Number and percentage of stable features with good (0.75 ≤ ICC 95% CI lower bound < 0.9) and excellent (ICC 95% CI lower bound ≥ 0.9) interobserver reproducibility grouped according to feature class and image type
Table 3 Discretization with fixed bin number and margin shrinkage segmentation. Number and percentage of stable features with good (0.75 ≤ ICC 95% CI lower bound < 0.9) and excellent (ICC 95% CI lower bound ≥ 0.9) interobserver reproducibility grouped according to feature class and image type
Table 4 Discretization with fixed bin width and contour-focused segmentation. Number and percentage of stable features with good (0.75 ≤ ICC 95% CI lower bound < 0.9) and excellent (ICC 95% CI lower bound ≥ 0.9) interobserver reproducibility grouped according to feature class and image type
Table 5 Discretization with fixed bin width and margin shrinkage segmentation. Number and percentage of stable features with good (0.75 ≤ ICC 95% CI lower bound < 0.9) and excellent (ICC 95% CI lower bound ≥ 0.9) interobserver reproducibility grouped according to feature class and image type

Feature ICC Values by Intensity Discretization Method and Segmentation Approach

The median and interquartile (first to third) range ICC 95% CI lower bound values of radiomic feature extracted from both T1-weighted and T2-weighted sequences are reported in Table 6, grouped according to image intensity discretization method and segmentation approach. A significant difference among ICC values was found using Friedman test for repeated samples on both T1-weighted and T2-weighted sequences (p < 0.001). In pairwise comparisons, higher feature ICC 95% CI lower bound values were found when performing image intensity discretization with fixed bin width compared to fixed bin number, regardless of the segmentation approach, on both T1-weighted and T2-weighted images (p < 0.001). On T1-weighted images, no difference in terms of ICC 95% CI lower bound was found between contour-focused and margin shrinkage segmentations after both discretization methods with fixed bin number (p = 0.8) and width (p = 0.62). On T2-weighted images, no difference in terms of ICC 95% CI lower bound was found between the two segmentation approaches after discretization with fixed bin number (p = 0.24). On T2-weighted images, higher ICC 95% CI lower bound values were found when performing margin shrinkage segmentation after intensity discretization with fixed bin width, compared to contour-focused segmentation (p < 0.001). In Fig. 4, box and whisker plots show the interobserver reproducibility of all features extracted from each MRI sequence using different discretization methods and segmentation approaches.

Table 6 ICC values by discretization method and segmentation approach. Median and interquartile (first to third) range ICC 95% CI lower bound values of radiomic features extracted from both T1-weighted and T2-weighted sequences, grouped according to discretization method and segmentation approach
Fig. 4
figure 4

Interobserver reproducibility by discretization method and segmentation approach. Box and whisker plots show the interobserver reproducibility of all features extracted using different discretization methods and ROIs without (contour focused segmentation) or with marginal erosion (margin shrinkage segmentation), grouped according to MRI sequence. FBN, fixed bin number; FBW, fixed bin width

Discussion

The main finding of our study is that the rates of stable radiomic features extracted from T1-weighted and T2-weighted MRI sequences were very high (90% or higher) regardless of the discretization method and segmentation approach. The discretization method with fixed bin width yielded higher stable feature rates and higher feature ICC values compared to fixed bin number, regardless of the segmentation approach with or without marginal erosion (p < 0.001). Additionally, no difference in stable feature rates was found between the segmentation approaches, regardless of the discretization method (p ≥ 0.175). Overall, a small but still not negligible degree of segmentation variability highlighted the need to include a reliability analysis in radiomic studies.

Radiomics has a great potential as a non-invasive biomarker to quantify several tumor characteristics, both standalone and combined with artificial intelligence methods such as machine learning [31,32,33]. However, it faces challenges to clinical implementation [34]. A great variability in radiomic features has emerged as a major issue across studies, and image segmentation is the most critical step [11]. As segmentation is time-consuming if performed manually, prior to conducting radiomic studies, methodological analyses would be desirable to preliminarily evaluate the robustness of different segmentation approaches and avoid biases due to non-reproducible, noisy features. Similar analyses were previously performed in kidney [16], lung and head and neck [14], and cartilaginous bone [15] lesions. Regarding lipomatous soft-tissue tumors, most radiomic studies included a feature reproducibility assessment as a dimensionality-reduction method in their radiomic workflow, which was built with the aim of differentiating benign from malignant (including low-grade) lesions [35,36,37,38,39,40,41,42]. More recently, Sudjai et al. compared the effects of intra- and interobserver segmentation variability on the reproducibility of 2D and 3D MRI-based radiomic feature reproducibility in lipoma and ALT [43]. A region growing-based semiautomatic contour-focused segmentation was performed on T1-weighted sequences by two readers and only original images were used for feature extraction, resulting in 43 out of 93 (46.2%) 2D features and 76 out of 107 (71%) 3D features with an absolute agreement ICC ≥ 0.75, which defined feature stability [43]. Based on their findings, we focused our study on 3D segmentations only, as they yielded higher stable feature rates. We compared two image intensity discretization methods (fixed bin number vs. fixed bin width) and two segmentation approaches (contour-focused vs. margin shrinkage) on both T1-weighted and T2-weighted sequences, involving three different readers as suggested by the ICC guidelines by Koo and Li [24]. After extraction of features from original, filtered and transformed images (1106 features per sequence compared to 107 in the previous study [43]), we found higher rates of stable features (90% or higher per sequence, regardless of the discretization method and segmentation approach) using ICC 95% CI lower bound ≥ 0.75 as a stricter cutoff to define feature stability. This difference could be attributed to the use of filtered and transformed (in addition to the original) images for feature extraction in our study, as well as to the different experiences of the readers involved in image segmentation, namely a statistician and a research scientist in the previous study [43] and three physicians in our study. Despite these differences, a common conclusion that can be drawn from the previous [43] and our studies is that most 3D MRI radiomic features of lipoma and ALT have good reproducibility, although a certain degree of segmentation variability exists.

In our study, T1-weighted and T2-weighted MRI sequences demonstrated good reproducibility regardless of the image intensity discretization method employed in image pre-processing, which was performed using both options of fixed bin number and fixed bin width, with stable feature rates respectively ranging from 90.69 to 95.66% and from 95.39 to 97.65%. The discretization method with fixed bin width resulted in higher stable feature rates and higher feature ICC values, thus providing more robust features compared to discretization with fixed bin number in our series. This finding is in line with previous positron emission tomography and MRI studies showing better feature reproducibility when implementing fixed bin width [44, 45]. Margin shrinkage led to an improvement in terms of feature ICC values compared to contour-focused segmentation only when implementing discretization with fixed bin width on T2-weighted images. Conversely, no difference in terms of feature ICC values was found between the two segmentation approaches when implementing discretization with fixed bin width on T1-weighted images or fixed bin number regardless of the employed MRI sequence. Additionally, no difference in terms of stable feature rates was found between the two segmentation approaches, regardless of the discretization method. Thus, a definite conclusion regarding the superiority of one segmentation approach over the other cannot be drawn. This confirms the need for a preliminary assessment of feature reproducibility in radiomic workflows and is in line with literature emphasizing the importance of reproducibility in artificial intelligence and radiology [46,47,48].

Some limitations of our study should be addressed. First, it has a retrospective design, as a prospective analysis is not strictly necessary for radiomic studies [49]. Second, the retrospective design accounts for the exclusion of contrast-enhanced MRI, which was not performed consistently in our series of lipomas and ALTs. This is in line with recent studies suggesting that the value of contrast administration may be limited in lipoma and ALT [6, 50], with no clear improvement in diagnostic accuracy following the addition of contrast-enhanced sequences to a non-contrast MRI protocol [50]. Finally, due to its scope, this was a single institution study, and the generalizability of our results should be confirmed on more varied datasets.

Conclusions

Radiomic features of lipoma and ALT extracted from T1-weighted and T2-weighted MRI sequences are reproducible regardless of the segmentation approach and segmentation method, although a minimal degree of segmentation variability exists and highlights the need to perform a preliminary reproducibility analysis in radiomic studies. As stable feature rates were similar between contour-focused and margin shrinkage segmentations, it could be reasonable to prefer the former approach for ease of use in clinical practice. Image intensity discretization with fixed bin width provided higher stable feature rates and feature ICC values compared to discretization with fixed bin number. Thus, the former discretization method might be favored when performing image pre-processing in future radiomic studies dealing with lipomatous soft-tissue tumors.