Introduction

Lung carcinoid tumors represent between 1 and 2% of primary pulmonary neoplasms in adults meaning 1.6 per 100,000 people in 2003 in the US [1, 2]. These tumors arise from neuroendocrine cells which are physiologically present throughout the lung tract [3]. The bronchopulmonary system is the second most frequent location of carcinoid tumors after the gastrointestinal tract [4]. Carcinoids have endobronchial localization in 85% of cases [2]. Typically, an atelectasis reveals the tumor with the macroscopic appearance of a “strawberry” when seen on bronchoscopy. Thanks to the wider use of chest CT, an increasing number of carcinoids are initially detected as incidental solid peripheral nodule [5]. In contrast, pulmonary hamartomas are benign tumors, often incidentally discovered and asymptomatic. Their histopathological characteristics may include fat, cartilage and epithelial tissue [6]. As the amount of fat and/or cartilage is variable [7], attenuation patterns on CT may vary and result in difficulties to non-invasively confirm the diagnosis. Based only on morphological CT features, hamartomas thus can mimic carcinoids, especially if there is neither macroscopic fat nor calcifications. Conversely, carcinoids may also contain fat and calcifications in varying proportions [8]. Clinically, both tumors are asymptomatic or present with cough or pneumonia if responsible for atelectasis [8, 9]. In these cases the radiological features of these tumors overlap, leading to difficulty in making a confident diagnosis by imaging and misdiagnosis is common. The differentiation between carcinoids and hamartomas is of clinical importance because localized forms of carcinoids require a surgical resection whereas no treatment is required for hamartomas. Surgery for these kinds of tumors vary from enucleation to pneumonectomy, with a morbidity rate reaching 15% for open thoracotomy [10]. There would be a clear clinical benefit to detect hamartomas pre-operatively, to avoid unnecessary surgery, without misdiagnosing a carcinoid tumor.

Radiomics is a data-driven research field using high-throughput mining of quantitative features extracted from medical images to discover new imaging biomarkers and enable phenotypic profiling of lesions [11]. There is an increasing interest for using radiomics to implement personalized medicine, especially in oncology [12] and in lung cancer [13]. As histological patterns for hamartomas and carcinoids are different, we hypothesized that radiomics may help differentiate these two tumor types.

The aim of this study therefore was to determine whether radiomics features on baseline lung contrast-enhanced CT images could distinguish carcinoids from atypical hamartomas.

Materials and methods

This retrospective study from two independent centers received approval from the Institutional Review Board (Comité d’Ethique pour la Recherche en Imagerie Médicale n°CRM-2202–229). One center was used as the training set to develop and cross-validate a model which was tested on an external validation set from the second institution (Fig. 1).

Fig. 1
figure 1

Illustrated full process of the study, from data curation to analyses

Training set

Atypical hamartoma was defined as hamartomas misdiagnose in multidisciplinary tumor board and treated with surgery. Using a thoracic surgeons’ register, all patients with histologically-confirmed diagnosis of carcinoids or hamartomas after surgery between November 2009 and June 2020 from Hopital Nord – Marseille – APHM—France were collected. The inclusion criteria were: 1) a pre-operative chest CT available in the Picture Archiving and Communication System of the institutions, 2) presence of a post-contrast acquisition in a soft tissue reconstruction kernel, 3) slice thickness ≤ 2 mm. The exclusion criteria were: 1) two tumors in the same operated lobe, 2) tumor including less than 64 voxels as suggested in previous studies [14], and 3) insufficient image quality for the measures (motion or beam hardening artifacts).

Examinations came from different institutions and CT devices and therefore were performed with variable parameters (Supplementary Material). The median slice thickness was 1.25 mm [Q1–Q3, 1.25–1.50].

The age and gender of the patients, and the type of surgery performed were collected. Visual imaging signs on CT previously described for these tumors were analyzed: central or peripheral (i.e., sub-segmental bronchus or lower) localization; endobronchial position (if the tumor was located entirely or partially within bronchus); presence of atelectasis (partial or total); bronchial contact (distortion of a bronchus near the tumor); calcifications; border (lobulated or smooth). The longest diameter on the axial plane (mm) was also measured. These characteristics were recorded by a radiologist (L.C., 4 years of experience in chest imaging), blinded to histology.

Feature extraction

Two radiologists (L.C. and P.H.) independently segmented volumes-of-interest (VOI) of the lesions on the soft kernel images, using 3D Slicer (version 4.7, National Institutes of Health–funded; https://www.slicer.org) [15, 16]. Large vessels and bronchi were not included. Radiologists were blinded to histology. The option to set the window width and window level was let to the radiologist preference within the software to efficiently delineate the nodule (examples in Fig. 2).

Fig. 2
figure 2

a axial, (b) coronal and (c) sagittal plane of contrast-enhanced chest CT in mediastinal window setting showing the 3D segmentation of one tumor, in blue, from which the ‘median’ feature was extracted (d) along with the volume rendering reconstruction of the merged segmentations of the two radiologists (blue and green) to illustrate reproducibility. e, f are two examples of red circular 2D ROIs used for measurement of mean attenuation. These circle was drawn using the dedicated tool of the viewer and place over 90% of the lesion, choosing the slice in which the tumor is the largest

The Pyradiomics library (version 3.0.1, Computational Imaging & Bioinformatics Lab—Harvard Medical School) was used to extract radiomics features from the VOI using Python (version 3.8.8) [17]. Fixed bin width was set to 64 with no other preprocessing filter. One hundred five radiomics features were extracted, including shape-based (14 features), first-order statistics (18 features) and textural features (73 features). Their definitions are detailed in Additional file 1: Table e1 and follow the Image Biomarker Standardization Initiative guidelines [18].

Table 1 Population characteristics

Feature reduction, selection, and model building

Feature reduction was performed based on inter-observer reproducibility and feature redundancy. Inter-observer reproducibility was evaluated for all features, and features presenting with pairwise intraclass correlation coefficients (ICC, two-way random effect, single rater, absolute agreement) < 0.8 were considered not reproducible [19] and excluded. Reproducible features were then compared two-by-two using a Spearman correlation, and highly correlated features with a coefficient ≥ 0.9 were considered redundant and only one was retained.

After feature reduction, a sequential step forward feature selection method was performed with a fivefold cross-validation setup to find the best performing feature combination on the training set. The scoring criterion was the area under the receiver operating characteristic curve (AUC). The radiomics signature that obtained the highest AUC value was selected to retraining a Random Forest Classifier (RF) with a tenfold cross-validation on the whole training set. The RF hyperparameters were fine-tuned using a grid search approach. The best operating point of the model was defined on the training set as the threshold maximizing the Youden index. AUC, sensitivity, and specificity were calculated. Their 95% CI were computed using bootstraps with 1000 repetitions.

Analysis of performance of the most important feature

The radiomics signature features were analyzed according to their importance in the model. The most important feature was analyzed independently to determine whether it could predict histology on its own, using the senior radiologist’s segmentations. The threshold optimizing the Negative Likelihood Ratio for predicting hamartomas was determined. Two thresholds were chosen also to optimize respective specificity for each type of tumor.

To approximate the 3D feature using widely available clinical tools, mean attenuation was also calculated from a circle-shaped 2D-ROI placed over 90% of the lesion, on the slice in which the nodule was the largest, avoiding calcifications if present, and performed by the two radiologists.

Reproducibility of 3D and 2D measures

The inter-reader reproducibility of the measure of the most important feature in the model (3D and 2D measure) was assessed on the training set using the ICC (two-way random effect, single rater, absolute agreement) and Bland–Altman method (bias, standard deviation of the bias, limits of agreement (LoA)). The reproducibility of three non-contiguous 2D measures within each tumor was assessed by the same methods.

Correlation between contrast enhancement quality and the most important feature

To be sure that concentration of contrast did not influence the measure made on carcinoids, a Spearman correlation was performed between the aforementioned feature and mean attenuation value obtained with a 2D-ROIs drawn in the pulmonary artery trunk.

Carcinoids were split in two groups using the cut-off of 250 HU in the pulmonary artery trunk obtained with 2D-ROIs, as previously validated as a quality criterion for chest-CT arterial enhancement [20] and compared to ensure that contrast concentration did not influence the measure.

External validation set

An external validation set from an independent center (Hopital Européen Georges Pompidou—APHP—Paris—France) was collected, identified from the pathology register, to test the model using the same inclusion and exclusion criteria as the training set. The lesions were delineated by one radiologist (P.H.), blinded to histology.

The radiomics signature, the three thresholds for the most important feature alone, and the 2D-derived measure were tested on this external validation dataset, and their performance was evaluated (AUC, sensitivity, specificity).

Corrected positive and negative predictive values (PPV and NPV) were calculated using the mean of the ten years’ prevalence observed in the two centers.

Statistical analysis

Continuous data were expressed as median [Q1-Q3]. Categorical data were expressed as frequency or percentage. A two-sided p-value < 0.05 was considered statistically significant. The Radiomics Quality Score was calculated [18]. Quantitative data are given with their 95% CI. A Mann–Whitney test was used to compare quantitative data. To compare semantic criteria of hamartomas and carcinoid tumors Mann–Whitney test was used to compare quantitative data and Chi-square test to compare qualitative data.

The following packages of Python were used: Numpy (version 1.20.1) and Pandas (version 1.2.4) for data handling; Mlxtend (version 0.19.0) and Scikit-Learn (version 0.24.1) for preprocessing, machine learning, and performance evaluation; and matplotlib (version 3.3.4) for plots. The ICC function was assessed using R (version 3.6.1) from the IRR package (version 0.84.1).

Results

Training and external validation datasets

Two hundred and six patients which had available post-operative histopathological reports of carcinoids or hamartomas were reviewed for the training set. Among them, 82 patients met the inclusion criteria. The following patients were excluded: two tumors in the same resected part of the lung (N = 1), small tumor size (N = 1), or insufficient image quality (N = 7). Finally, 73 patients with a median age of 58 [43–69] years, including 16 hamartomas and 57 carcinoids (42 typical and 15 atypical) were retrospectively analyzed (Fig. 3). The external validation dataset, following the same inclusion and exclusion criteria as the training set, included 54 patients (32 carcinoids including 25 typical and 7 atypical, and 22 hamartomas).

Fig. 3
figure 3

Flowchart of the study

There was no statistically significant difference in semantic characteristics for each type of tumors between the training and external validation datasets, except for endobronchial protrusion in hamartomas which was more frequent in the training set (Table 1). Comparison between all hamartoma and carcinoid tumors has been added in supplemental material (Table 2).

Table 2 Comparison between hamartoma and carcinoid tumors

Radiomics criteria

Median number of pixels in 3D VOIs was 2296 [Q1‒Q3, 417‒7856]. Median number of pixels in 2D ROIs was 200 [62‒428]. Radiomics feature reduction according to ICC ≥ 0.8 resulted in 56 reproducible features. Among them, 32 were redundant (Spearman correlation coefficient ≥ 0.9), leaving 24 features for subsequent analyses. The sequential step forward feature selection using the bagging classifier on the training set yielded a radiomics signature of five features that maximized the AUC value in distinguishing the two tumors (0.89 [95% CI: 0.81–0.98]).

These features were: first-order features ('Median' and 'Maximum’ attenuation) and texture features ('DifferenceVariance,’ ‘SmallDependenceHighGrayLevelEmphasis' and 'Coarseness'). The Youden index was 0.61. When applied on the external validation set, the radiomics signature AUC, sensitivity and specificity were 0.76 [95% CI: 0.71–0.82], 91% [95% CI: 86–95%] and 46% [95% CI: 37–55%], respectively (Fig. 4). The importance of each feature in the model was 0.31, 0.26, 0.18, 0.15 and 0.10, respectively.

Fig. 4
figure 4

ROC curve of the RF model on the external validation set and the corresponding confusion matrix. Example of two axial slice of enhanced chest CTs showing a hamartoma (a) with a 3D median attenuation of − 15 HU and 2D mean attenuation of − 22 HU, and a carcinoid tumor (b) with a 3D median attenuation of 71 HU and a 2D mean attenuation of 77 HU

The Radiomics Quality Score was performed according to the standard for radiomics studies and was 47/100 (Additional file 1: Table e2).

Most important feature 3D and 2D-ROIs thresholds

The radiomics 3D ‘median’ attenuation feature, corresponding to the median HU value in the 3D VOI, reached a cross-validated AUC score of 0.85 [95% CI: 0.74–0.96] on the training set (Additional file 1: Figure e1). We selected the following intensity thresholds to predict hamartoma or carcinoids with high specificity on the training set: < 10 HU to predict hamartomas (specificity 96%, [95% CI: 96–99%]), > 60 HU to predict carcinoids (specificity 68%, [95% CI: 55–79%]). The threshold that maximized the Negative Likelihood Ratio (4.9) was 40 HU, with a sensitivity of 69% ([95% CI: 44–86%]) and a specificity of 86% ([95% CI: 75–93%]) to predict hamartomas on the training set for tumors with a 3D ‘median’ attenuation value below 40 HU.

These thresholds were then applied on the external validation dataset using the 3D ‘median’ attenuation feature and the easy-to-use in clinical practice mean attenuation measured on the 2D-ROI. These results are summarized in Fig. 5 and detailed in Table 3. The positive and negative predictive values were calculated and corrected using the mean prevalence of hamartomas of the two centers (prevalence from the center one: 35%, from the center two: 18%, mean prevalence: 26%).

Fig. 5
figure 5

Results of the application of different thresholds selected from the training set ROC curve, to predict hamartoma of carcinoids on the external validation dataset using the ‘median’ feature extracted from 3D segmentations and the 2D mean attenuation

Table 3 Confusion matrix for different thresholds on validation set with corrected prevalence

Reproducibility of the most important feature

The ICC of the 3D ‘median’ attenuation feature and 2D mean attenuation were 0.97 ([95% CI: 0.95–0.99]) and 0.90 ([95% CI: 0.85–0.94]), respectively. The evaluation of reproducibility of the 3D ‘median’ attenuation feature and 2D-mean density using the Bland & Altman method showed that the 3D ‘median’ attenuation feature was more reproducible than the 2D mean attenuation (bias 3 ± 7 HU, LoA [–10–16] vs − 0.7 ± 20 HU, LoA [–40–40] (Fig. 6).

Fig. 6
figure 6

Bland and Altman plot, for difference in HU values measured between reader 1 and reader 2 with 3D ‘median’ attenuation feature on the left side and 2D mean attenuation on the right side. The ‘median’ feature is more reproducible than the 2D measure, with a smaller standard deviation of the bias which is more clinically relevant to the previously defined thresholds to diagnose carcinoids and atypical hamartomas

The ICC of mean values from 2D mean attenuation at three different levels of the tumor was excellent (0.93, [95% CI: 0.90–0.96]). But the LoA was wide and difficult to adapt to the scale previously described to predict hamartomas and carcinoids (Additional file 1: Figure e2).

There was no correlation between the pulmonary artery trunk enhancement and the 3D ‘median’ attenuation of carcinoids in the training set (r = 0.07), and no significant difference in the value of the 3D ‘median’ attenuation feature between chest CT with 2D-ROIs in pulmonary trunk > 250 HU versus those < 250 HU (p = 0.544) (Additional file 1: Figure e3).

Discussion

Radiomics features allowed identifying imaging features differentiating lung atypical hamartomas from carcinoids, with an AUC of 0.76. The ‘median’ attenuation (HU) was the most important feature in the model, this feature alone and on the training set reached an AUC of 0.85. Best thresholds to predict hamartomas and carcinoids on the external dataset were < 10 HU and > 60 FHU, respectively. 2D mean attenuation measured on circular ROIs gave good results, with a difference in sensitivity and specificity below 10% compared to the 3D ‘median’ attenuation feature. The 3D ‘median’ attenuation feature was slightly more reproducible than the 2D mean attenuation. In this case the simple attenuation features outperform the model and was more efficient.

Typical hamartomas combining fat, tissue and cartilaginous calcifications are easy to diagnose but infrequent [21], especially when small. Hamartomas do not require treatment, but they can easily be confounded with carcinoid tumors when the radiological presentation is atypical that mean without calcification or fat, leading to unnecessary surgeries and potential complications. Clinical presentation could be the same: asymptomatic, cough, pneumonitis. Semantic radiological criteria have an area of overlap for atypical hamartomas and carcinoid tumors [22, 23]. To add difficulty, carcinoid tumors may present with calcifications, Huang et al. report a series on 21 cases [21], in which 12% of carcinoids contained calcifications while others studies found calcifications in 33% or 40% [23, 24]. Another area of overlap concerns the traditional “central and endobronchial” description of carcinoids (opposite to the description of “peripheral and extra bronchial” hamartoma) as it seems to have been overestimated in the past, as shown in recent studies (down from 84% in older series to 47–57% in recent series) [23,24,25] probably due to the increased use of CT. Presentation entirely inside the bronchial lumen is not the most frequent situation for carcinoids, 69% were not endobronchial in our study, similar to previous recent publications reporting 75% and 77% [23, 24]. As all these publications are retrospective and on small numbers of patients, sensitivity and specificity of semantic signs are not reported.

PET/CT may not have additional value as carcinoids and hamartomas both can present with either slightly increased or no 18FDG uptake [22]. The best tool seems 68 Ga-DOTATOC, achieving a detection rate of 88.4% with threshold of SUVmax > 2.5 but with a high rate of false-negatives [26].

Radiomics is a promising field for tumor characterization and has already proven its efficiency for carcinoids, to discriminate the different levels of Ki-67 expression or metastatic diseases [8]. From an initial strategy based on machine learning using radiomics features, we identified a five feature-signature with good results on an external validation dataset. We wished to explore whether this signature could be simplified. Using the 3D ‘median’ attenuation feature alone performed better on the external validation dataset than the RF algorithm and the signature. This illustrates the limitation of these signatures built on training sets, the meaning of which are often difficult to understand, and the challenge to find a generalizable, robust and reproducible signature for clinical practice [27]. There are different methods to reduce and select radiomics features. The extraction using the open-source Pyradiomics tool is today widely used thanks to its availability [28]. The chosen method to reduce features, removing non reproducible and redundant features, has already been published with good results [29] but there is not today a single accepted method.

Radiomics feature are dependent on acquisition parameters, such as pixel size and kernel [30]. Multiple CT machines were used in our study leading to heterogeneity in CT protocols. To limit this bias, we excluded patients with slice thickness > 2 mm, and we used only smooth kernel reconstructions for segmentations. The use of an external dataset to validate the model and the other diagnostic features is a key point in the IBSI guideline. Three-dimensional segmentations were drawn, more time consuming than 2D-segmentations, but gave results which seemed more reproducible for clinical use, and allowed getting rid of the inter-slice variability of 2D measure. The manual segmentations of tumors could introduce measurement bias, but we performed double independent segmentations and we controlled for reproducibility according to imaging biomarkers recommendations [31, 32].

This study has some limitations. We did not analyze all the hamartomas of the centers but only those who had been surgically removed. Though this led to a selection bias, this dataset represents hamartomas that are challenging in a clinical context, since they were not diagnosed pre-operatively. Finally, we tried to simplify the 3D measure by a 2D measure simpler to implement in routine, but calculated the mean instead of the median, as it was the measure most frequently available on clinical PACS. Mean value is influenced by extreme values while median is not, but due to the high number of pixels in each tumor, we hypothesized that the distribution of the HU value within tumors could tend toward a Gaussian distribution. However, though performance was similar for 2D and 3D, the latter’s higher reproducibility makes it a more reliable potential biomarker. We could guess what the goal of using radiomics in this study if simple attenuation feature outperforms the model. But to select these features and understand their importance in the machine learning model we must use radiomics. It seems that the second order features disturb the machine learning process and lower the performance of the model compared to the simple attenuation features.

In conclusion, a RF algorithm using radiomics features extracted from 3D-segmentations could differentiate atypical hamartomas from carcinoid tumors in lung on an external validation set with good performance (AUC = 0.76). Features based on HU participated for 57% in the model. The 3D ‘median’ attenuation alone reached an AUC = 0.85 on the training set. We propose diagnostic thresholds < 10 HU to confidently predict hamartomas and > 60 HU to confidently predict carcinoids with high specificity. 3D ‘median’ attenuation was a highly reproducible feature between two readers. The simpler 2D mean attenuation measurement was equally accurate but not reproducible enough between readers to be used.