Background

Radiomics is the high-throughput extraction and analysis of computationally derived features from medical images [1]. Several studies have shown promising results in predicting tumour phenotype and prognosis with the help of radiomics, particularly those using features derived from texture analysis that describe intratumoural heterogeneity [2]. However, since it extracts information from the entire tumour, accurate tumour delineation by an appropriate segmentation algorithm is an important step in measuring tumour specific image parameters. A key property of a segmentation algorithm is that it should delineate tumour volume to a degree of accuracy sufficient to preserve the radiomic signature of the lesion. Using an inaccurate segmentation algorithm can cause incorrect tumour delineation by including adjacent non-tumour structures or exclude significant tumour regions. Both advanced and basic texture features alike are thus incorrectly estimated, and this incorrect estimation can potentially cause errors in prediction of tumour biology and patient outcome.

Besides accuracy, reproducibility is also an essential property for image segmentation algorithms and includes reproducibility across different acquisitions and reconstructions [3,4,5,6], multiple observers [7], different segmentation algorithms [8], and different bin ranges [6]. Lack of reproducibility, even in a few parameters, can impact the validity of serial measurements over time (e.g. to determine response assessment) and measurements performed on separate acquisitions (e.g. in multicentre trials).

Multiple segmentation algorithms exist, e.g. freehand algorithms (FH), thresholding-based algorithms (e.g. fixed thresholding at 40% of maximum intensity cut-off [40P]), and algorithms based on probabilistic classification of voxels into tumour or background (e.g. fuzzy locally adaptive Bayesian [FLAB]) [9]. In measuring tumour volumes, FLAB has been shown to be more accurate than threshold-based segmentation, especially for small or heterogeneous lesions (≤17 mm diameter) [8, 9]. The better results are probably because FLAB incorporates both the spatial context and intensity while classifying voxels, unlike threshold-based segmentation, which classifies voxels based purely on their intensities relative to the intensity of the most intense voxel in the region.

Nevertheless, a study on oesophageal cancer has shown no clear advantage in using either thresholding or FLAB in predicting patient survival—a frequent end-point of radiomics research [10]. Furthermore, there are no data directly comparing FLAB, FH, and thresholding in terms of inter-observer reproducibility of derived parameters. We hypothesised that different segmentation algorithms (FH, 40P and FLAB), despite potentially being discrepant in tumour volume delineation, are not significantly different in the prognostic power of derived texture features. If this hypothesis holds true, then the segmentation algorithm of choice would be the one that is most reproducible. The aim of our study was to test inter-observer variation of texture features in 18F-fluorodeoxyglucose (FDG) PET images across the three segmentation algorithms in a cohort of patients with non-small cell lung cancer (NSCLC) and to determine the effect of the different segmentation algorithms on the derived texture features’ prognostic performance.

Methods

Patients

Fifty-three consecutive patients (mean age 65.8 years; 31 males) with NSCLC treated with chemoradiotherapy (64 Gy with concurrent vincristine–cisplatin or vincristine–carboplatin chemotherapy) between 2007 and 2009 were included. All patients had single tumours. Most patients had locally advanced (stage III) NSCLC and were inoperable. Overall survival (OS) was recorded from the date of the 18F-FDG PET scan and was defined as the time in months between the PET scan and the date of death. Patients who were alive were censored at the time of the last clinical follow-up. A waiver of institutional review board approval was obtained for this retrospective analysis of anonymised data.

Image acquisition and post-processing

18F-FDG PET/CT scans were performed at a median of 45 days (range 0–174 days) before treatment and all scans were acquired to the same protocol in the same institution on one of two scanners (Discovery VCT or DST, GE Healthcare, Chicago, USA) which are cross-calibrated to within 3% [11]. Patients fasted for at least 6 h before being injected with 350–400 MBq 18F-FDG intravenously. Ninety minutes (range 82–104 min) after tracer injection, PET images were acquired from the base of the skull to the upper thighs. Volumetric images were reconstructed using the ordered subset expectation maximisation algorithm with a slice thickness of 3.27 mm and pixel size of 4.7 mm. Low dose CT was acquired for attenuation correction at 120 kVp and 65 mAs without administration of oral or intravenous contrast agent. The reconstructed 18F-FDG PET datasets were imported into in-house texture analysis software implemented in MATLAB (Release 2013b, The MathWorks, Inc., Natick, Massachusetts, USA). Three readers, a radiation oncologist (GA), a radiologist (UB), and a nuclear medicine physician (GC), with 1, 8, and 25 years of  18F-FDG PET imaging  experience, independently drew freehand regions of interest around the metabolically active primary lung tumours on each axial slice on the 18F-FDG PET scans to generate a volume-of-interest (VOI). Care was taken to exclude adjacent metabolically active structures, e.g. the heart and lymph nodes. Each FH VOI served as the template for automatic segmentation algorithms, i.e. 40P and FLAB. Before applying the respective algorithms, the FH VOIs were expanded by 5 pixels in three-dimensions to cover the entire tumour and some background non-tumour surrounding regions. The expanded VOIs were inspected again to ensure that no adjacent metabolically active non-tumour tissue had inadvertently been included as a result of expansion.

Forty percent of maximum threshold

The 40P VOIs were derived from the expanded VOIs by retaining only voxels showing activity equal to or greater than 40% of the maximum activity voxel inside the VOI [12].

Fuzzy locally adaptive Bayesian

Using the FLAB algorithm [9], the expanded VOI voxels were categorised into three classes representing tumour core, region of partial volume averaging around tumour core, and background. Voxels assigned to the background class were discarded, and the remaining volume, i.e. tumour core and region of partial volume averaging, was kept as the final FLAB VOI in accordance with a previous description of this algorithm [9].

Texture parameters

The voxels of each VOI were resampled into 64 discrete bins of grey-scale values based on previous reports on optimum quantization schemes [6, 10]. The VOIs were then processed as three-dimensional matrices from which 83 texture parameters were derived: 6 model-based parameters (fractal) and 77 statistical parameters (20 first-order, 22 second-order and 35 higher-order).

Statistical analysis

Statistical analysis was conducted using R 3.1 [13]. For inter-observer variability, all three readers’ datasets were used. The remaining analyses were performed on a single dataset of the most experienced observer (GC). Q-Q plots were examined to detect skewed distributions. Twenty - seven out of the 83 derived texture parameters showed highly positively skewed distributions; these parameters were log-transformed (base 10).

Before comparing derived texture parameters between segmentation algorithms, we compared measured tumour volumes by the different algorithms, since volume provides the foundation for all subsequent analyses. However, comparison of scalar volumes is inadequate because two techniques can give identical volumes yet have measured different regions and thus be discordant. Therefore, we used the Jaccard similarity index (JSI) to obtain a voxel-by-voxel comparison between VOIs drawn with different segmentation algorithms. JSI computes agreement between two VOIs drawn with different segmentation algorithms on a voxel-by-voxel basis. When both VOIs are identical, the JSI is equal to one, and when they are discordant, the JSI is equal to zero. The JSI was multiplied by 100 to obtain percent -agreement.

We used FLAB as the reference set, based on results from phantom studies [9], and FH and 40P were used to derive 2 sets of JSI for all 53 cases—1 set for FH/FLAB and 1 for 40P/FLAB. The JSI of FH/FLAB set was compared with the JSI of the 40P/FLAB set using the Mann-Whitney U test. The effect of tumour size on percent-agreement was assessed visually using scatterplots.

The ICC was used to measure the agreement between the three readers for each derived texture parameter. This yielded three sets of 83 ICC values—1 set per segmentation algorithm. To rate reproducibility of a segmentation algorithm, arbitrary cut-offs were used to denote high (ICC >0.85), moderate (0.7–0.85) and low (<0.7) reproducibility. To compare ICC values derived from different segmentation algorithms, 95% confidence intervals (CI) of pair-wise differences in ICC values were calculated from the same data resampled 100 times, using a bootstrapping approach [14]. A difference in ICC between two segmentation algorithms was considered non-significant if the CI included zero.

To determine if a given segmentation algorithm preserved sufficient relevant tumour information, we chose a subgroup of 12 texture parameters for univariate Cox regression analysis: metabolically active tumour volume (MATV), total lesion glycolysis (TLG), SUVmean, SUVmax, SUV standard deviation, first-order entropy, grey-level co-occurrence matrix (GLCM) entropy, GLCM homogeneity, GLCM dissimilarity, grey-level size zone matrix (GLSZM) intensity variability, neighbourhood grey-tone difference matrix (NGTDM) coarseness and NGTDM contrast. We selected these variables after carefully reviewing relevant publications for their reported associations with patient survival [7, 10, 15,16,17]. The quality of each univariate cox regression model was assessed using two statistics: the Wald statistic was used to determine the statistical significance of the derived coefficient (cut-off p value: 0.05), whereas the Akaike information criterion (AIC) was used to determine goodness of fit of the model [18]. Our purpose was not to validate any of these texture parameters; we accepted them as valid biomarkers based on our literature review and merely used them to measure the performance of each segmentation algorithm. Thus, detailed survival analysis, i.e. analysis employing Kaplan-Meier curves and multivariate Cox regression, was not performed.

Results

Patient demographic and clinical characteristics are listed in Table 1.

Table 1 Patient demographics and clinical characteristics

Effect of segmentation algorithm on tumour-volume estimation

Overall, 40P yielded the smallest volumes and FH the largest (Fig. 1).

Fig. 1
figure 1

Boxplots comparing the three segmentation algorithms in volume measurement. The boxes represent the interquartile range (IQR). Horizontal lines through the boxes show median values. The whiskers represent values within 1.5*IQR. 40P = 40% of maximum intensity threshold. FH freehand, FLAB fuzzy locally adaptive Bayesian

Taking FLAB as the reference standard, mean JSI between FH and FLAB was 71.6% (range 48–87%; SD 9.4%) and between 40P and FLAB was 70.7% (range 10.5–98.1%; SD 21.1%). The difference between the two means was not statistically significant (p = 0.24). Furthermore, the percent-agreement of JSI did not appear to be related to tumour size (Fig. 2).

Fig. 2
figure 2

Lesion volumes computed with FLAB are plotted against JSI between FLAB and FH (a) and JSI between FLAB and 40P (b). Slope lines are shown along with 95% standard error of slope (dashed lines). r 2 values are displayed on the figures. The nearly straight slope lines and small r 2 imply that there is no particular trend to the degree of mismatch between FH and FLAB derived volumes over the range of tumour sizes. JSI Jaccard similarity index

Effect of segmentation algorithm on inter-observer reproducibility of derived texture parameters

Summary ICC statistics are as follows: FH (median ICC 0.77; IQR 0.7–0.85), FLAB (median ICC 0.83; IQR 0.77–0.86) and 40P (median ICC 0.9; IQR 0.85–0.92). Ranked on the basis of defined cut-offs, FH, FLAB and 40P showed high ICC (>0.85) in 20, 27 and 62 parameters, moderately high ICC (0.7–.85) in 40, 42 and 19, parameters and low ICC (<0.7) in 23, 13 and 2 parameters, respectively.

When ICC values of the segmentation algorithms were compared for individual texture parameters, the following observations were made: compared with FLAB, 40P showed greater ICC values for 77 of 83 parameters (statistically significant in 10 parameters); FLAB had higher ICC than 40P in the remaining 6 cases, 1 reaching statistical significance. Both FLAB and 40P had higher ICC than FH in 58 and 73 parameters, respectively, reaching statistical significance in 5 and 30 parameters, respectively. For SUV-range, FH had significantly greater ICC value than FLAB; compared with 40P, FH had significantly greater ICC in none. Group-wise comparison among texture parameters showed that the first-order histogram measures were the most reproducible for all segmentation algorithms (Fig. 3).

Fig. 3
figure 3

Boxplots comparing ICC values for the three segmentation algorithms over the four groups of texture parameters, i.e. first-order, second-order, and higher-order statistics, and model-based parameters. 40P 40% of maximum intensity threshold. FH freehand, FLAB fuzzy locally adaptive Bayesian, ICC intraclass correlation coefficient

ICC values of commonly reported texture parameters are given in Table 2.

Table 2 Comparison of ICC values of 11 commonly reported texture parameters derived with 3 contending segmentation algorithms

Effect of segmentation algorithm on survival prognostication

Patients were followed up for a median 21.2 months (range 2.1–51.1 months). Median OS was 25.6 months. On univariate Cox regression analysis, 40P found 2 out of 12 variables, i.e. first-order entropy and GLCM entropy, to be significantly associated with OS; FH and FLAB found 1, i.e. first-order entropy (Table 3). For each tested variable, survival models for all three segmentation algorithms had comparable AIC values with overlapping 95% CIs showing equivalent fit to the data (Table 3).

Table 3 Results of univariate cox proportional hazards done on 12 commonly reported texture parameters

Discussion

Our study found moderate to high reproducibility between observers, with 40P (median ICC 0.9) ranking highest, followed by FLAB (median ICC 0.83) and then FH (median ICC 0.77). Despite yielding smaller median volumes than FLAB and FH, and theoretically losing some texture information, 40P compared favourably with FLAB and FH by detecting a significant association between first-order entropy and survival. It found a second significant survival predictor (GLCM entropy) for which the other two segmentation algorithms were inconclusive.

For tumour volume measurements, we chose FLAB as proxy ground-truth in the absence of true measurements of resected specimens [10]. Although there is no consensus on segmentation algorithm suited for MATV delineation, a study has shown FLAB to be more accurate than a fixed threshold in tumour delineation over a range of phantoms (n = 6; sizes 10–37 mm) and simulated tumours (n = 3) [9]. There are no large studies comparing both techniques with resected specimen measurements. We found good voxel-by-voxel match (mean FLAB/FH and FLAB/40P match of 72 and 71%, respectively, p = 0.7)), uncorrelated with lesion size in our cohort of relatively large tumours. This means that in tumour sizes typically encountered, 40P- and FH-delineated volumes match FLAB-delineated volumes reasonably well and the degree of mismatch is stable over a range of tumour sizes. Segmentation with 40P generally estimated smaller tumour volumes. This is likely due to indiscriminate exclusion of voxel intensities below the fixed 40% threshold, such as those arising from low-activity tumour regions or tumour boundaries (subject to partial volume averaging with neighbouring tissue). FH, on the other hand, estimated larger tumour volumes. This is likely due to inclusion of some partial volume-averaged regions and some background region in physician - drawn contours.

In terms of inter-observer agreement, 40P ranked highest. We found FH to be inferior to other methods in inter-observer reproducibility, despite having a moderately good overall reproducibility (median ICC 0.86). The lower overall reproducibility of FH is due to its operator dependence, as opposed to the 40P and FLAB, which are semi-automatic. Other studies have also found FH to have moderate to good inter-observer variability [1, 19]. However, given that it is time-consuming and less reproducible than 40P and FLAB, we do not consider it the segmentation algorithm of choice.

We found a median ICC of 0.9 for 40P, which is similar to that reported in the literature [19]. 40P showed highest group-wise median ICC for all four groups of texture parameters, as illustrated in Fig. 3. Segmentation with 40P was especially robust in first-order statistical measures, for which it showed a median ICC of 0.94 compared with 0.86 for FH and 0.87 for FLAB. The reason for the higher reproducibility of 40P in first-order statistical measures is probably partly due to the inherent robustness of first-order statistical measures, as noted by others in CT [20] and PET studies [4, 7]. Furthermore, since 40P depends mainly on inclusion of the most active voxel in the region, and the calculation of the remaining volume is done automatically, it will not differ significantly between operators.

Comparing groups of texture parameters in terms of inter-observer reproducibility we found that all statistical measures were moderately to highly reproducible using any of the three segmentation algorithms, with first-order features ranking highest (median ICC for FH, FLAB and 40P: 0.86, 0.87 and 0.94, respectively). On the other hand, fractal dimension-related measures were least reproducible using FH (median ICC 0.64; IQR 0.47–0.74), FLAB (median ICC 0.74; IQR 0.73–0.85), or 40P (median ICC 0.77; IQR 0.75–0.88). While fractal dimension-related measures may have a role in lesion classification using CT [21], we did not find any reports supporting their role as prognostic biomarkers. Hence, these texture measures may not be useful in predicting prognosis.

Several texture analysis studies on patient survival have used freehand, threshold (40–50%) and FLAB algorithms [3, 7, 15, 22,23,24], highlighting different parameters in terms of their usefulness. However, these contending algorithms have not been compared in terms of effect on prognostic ability of derived texture parameters. In this regard, we found that 40P performed comparably to FH and FLAB in predicting overall survival when using first-order entropy. The fact that 40P discovered first-order entropy and an additional significant association between OS and a texture parameter (GLCM entropy), despite measuring generally smaller volumes, suggests that it preserves lesions’ radiomic signatures. Both first-order entropy and GLCM entropy have been shown in previous studies to have a potential role in predicting OS [7, 15, 16].

Our study has several potential limitations. First, the sample size is moderate but a larger sample may have revealed further prognostic associations. Nevertheless, as the main objective was to compare segmentation algorithms, the results remain informative. Second, we did not perform respiratory gating while performing 18F-FDG PET examinations of the lungs. Respiratory motion has been shown to add variability to measured texture parameters [25]. Third, we did not have histological ground-truth as a reference standard to compare the accuracy of volume delineation with the different algorithms. A few small studies using resected specimen measurements have found FLAB to be more accurate than fixed threshold [8]. Fourth, although our routine clinical protocol is to scan at 90 min post injection, many other departments scan at 60 min. It is possible that slightly different scan times may impact on segmentation volumes as there continues to be differential redistribution of FDG between benign and malignant tissue over time. Finally, we only tested the segmentation algorithms in an inter-observer reproducibility setting. Studies assessing test-retest reproducibility from different scanning sessions of various algorithms are necessary to validate best algorithms for multicentre trials and serial response-assessment examinations. It is possible that due to variation in maximum pixel intensity due to noise and reconstruction parameters in separate PET examinations, fixed thresholding may not be as reproducible as its contenders. While first-order statistical measures may not suffer significant differences in the test-retest setting [19], this potential shortcoming should be considered in radiomics research employing large numbers of higher-order variables.

Conclusions

Compared to FH and FLAB, 40P is a robust segmentation algorithm for 18F-FDG PET texture analysis in NSCLC in terms of inter-observer variability, and it also produces the highest number of texture parameters associated with patient survival. It is therefore considered a clinically acceptable segmentation algorithm for texture analysis in NSCLC.