Characterisation of malignant peripheral nerve sheath tumours in neurofibromatosis-1 using heterogeneity analysis of 18F-FDG PET

Purpose Measurement of heterogeneity in 18F-fluorodeoxyglucose (18F-FDG) positron emission tomography (PET) images is reported to improve tumour phenotyping and response assessment in a number of cancers. We aimed to determine whether measurements of 18F-FDG heterogeneity could improve differentiation of benign symptomatic neurofibromas from malignant peripheral nerve sheath tumours (MPNSTs). Methods 18F-FDG PET data from a cohort of 54 patients (24 female, 30 male, mean age 35.1 years) with neurofibromatosis-1 (NF1), and clinically suspected malignant transformation of neurofibromas into MPNSTs, were included. Scans were performed to a standard clinical protocol at 1.5 and 4 h post-injection. Six first-order [including three standardised uptake value (SUV) parameters], four second-order (derived from grey-level co-occurrence matrices) and four high-order (derived from neighbourhood grey-tone difference matrices) statistical features were calculated from tumour volumes of interest. Each patient had histological verification or at least 5 years clinical follow-up as the reference standard with regards to the characterisation of tumours as benign (n = 30) or malignant (n = 24). Results There was a significant difference between benign and malignant tumours for all six first-order parameters (at 1.5 and 4 h; p < 0.0001), for second-order entropy (only at 4 h) and for all high-order features (at 1.5 h and 4 h, except contrast at 4 h; p < 0.0001–0.047). Similarly, the area under the receiver operating characteristic curves was high (0.669–0.997, p < 0.05) for the same features as well as 1.5-h second-order entropy. No first-, second- or high-order feature performed better than maximum SUV (SUVmax) at differentiating benign from malignant tumours. Conclusions 18F-FDG uptake in MPNSTs is higher than benign symptomatic neurofibromas, as defined by SUV parameters, and more heterogeneous, as defined by first- and high-order heterogeneity parameters. However, heterogeneity analysis does not improve on SUVmax discriminative performance.


Introduction
Neurofibromatosis-1 (NF1) is an inherited disease characterised by multiple neurofibromas in which there is an increased risk of malignant transformation to malignant peripheral nerve sheath tumours (MPNSTs) [1]. Non-invasive differentiation of benign symptomatic neurofibromas from those with malignant transformation is a clinical challenge. Standardised uptake value (SUV) or tumour-to-liver ratio measurements from 18 F-fluorodeoxyglucose ( 18 F-FDG) positron emission tomography (PET) have previously been described as an accurate method to detect MPNSTs in this patient group [2][3][4][5][6]. Qualitative scoring of heterogeneity of 18 F-FDG PET on a three-point scale has also been described where MPNSTS displayed a more heterogeneous uptake of tracer with similar discriminatory power to maximum SUV (SUVmax) [7].
There is increasing interest in the quantitative measurement of heterogeneity in medical images of cancer patients, including computed tomography (CT), magnetic resonance imaging (MRI) and PET. There is evidence that the use of heterogeneity parameters may improve characterisation, segmentation, prognostication and therapy response assessment compared to standard metrics such as size or lesion activity [8][9][10][11][12]. The most commonly used methods involve the measurement of statistically based parameters including first-, second-and high-order features. First-order features include global parameters such as SUV but also heterogeneity parameters, such as standard deviation (SD), first-order entropy and first-order uniformity. These are derived from intensity volume histograms of a tumour volume of interest (VOI) [8,10,12]. Second-order features, most often derived from grey-level co-occurrence matrices (GLCM), measure the relationship between pairs of voxels [13] and high-order features, most often derived from neighbourhood grey-tone difference matrices (NGTDM), measure the relationship between three of more voxels in the same or adjacent planes [14].
Our hypothesis was that quantitative heterogeneity parameters from 18 F-FDG PET could improve differentiation of benign symptomatic neurofibromas from MPNSTs compared to standard PET metrics such as SUV and our aim was to compare discriminative ability in a retrospective cohort of patients with NF1 whose tumours had been well-characterised.

Patients and methods
A cohort of 54 consecutive patients with NF1 and clinical suspicion of malignant transformation of symptomatic neurofibromas, referred from our national neurofibromatosis service for 18 F-FDG PET/CT scans, was identified. There were 30 male (mean age 34.7 years, range 12 to 73 years) and 24 female patients (mean age 35.5 years, range 9 to 86 years). An institutional review board waiver was obtained for retrospective analysis of these data. 18 F-FDG PET/CT scans were all acquired to the same protocol in the same institution on one of two scanners (Discovery VCT or DST, GE Healthcare, Chicago, IL, USA) which were cross-calibrated to within 3% [15]. Patients were fasted for at least 6 h prior to administration of 350 (+/− 10%) MBq 18 F-FDG (scaled to body weight/70 in paediatric patients) and were only acquired if the blood glucose measurement was less than 10 mmol/l. Scans were acquired according to the institutional standard clinical protocol for NF1 patients with an acquisition at approximately 1.5 h (101.5 +/− 15 min) from the upper thigh to the base of skull followed by an acquisition at approximately 4 h (251.7 +/− 18.4 min) of the symptomatic tumour site only, all at 5 min per bed position [2]. Images were all reconstructed using an ordered subset expectation maximisation algorithm (2 iterations, 20 subsets) with a reconstructed slice thickness of 3.27 mm and pixel size 4.7 mm. The CT component of the scans was acquired at 120 kVp and 65 mAs without administration of oral or intravenous contrast agent.
The reconstructed PET datasets were imported into inhouse texture analysis software implemented in MATLAB (Release 2016a, The MathWorks, Inc., Natick, MA, USA). Voxel intensities within the symptomatic tumour VOI were resampled to yield 64 discrete bins. Whilst most patients had multiple neurofibromas, only the symptomatic tumours were analysed. Since many of the tumours showed only low-grade FDG uptake, it was not possible to adequately segment the tumour regions directly from the PET data by freehand or by using semi-automated methods such as percentage threshold or fuzzy locally adaptive Bayesian methods [16]. Regions of interest were, therefore, drawn on the corresponding CT images where tumours were more easily defined ( Fig. 1) by an experienced operator with radiology and nuclear medicine training and over 20 years experience. To assess interobserver variability, a random subset of 16 patients had VOIs defined on 1.5-and 4-h scans by a separate operator blinded to the initial observer measurements and clinical data.
As well as SUVs (mean, maximum and peak, all normalised to body weight in kilogrammes), three first-order (SD, entropy and uniformity), four second-order GLCM parameters (contrast, entropy, uniformity and homogeneity) and four high-order NGTDM parameters (coarseness, contrast, busyness and complexity) were calculated from the resulting VOIs. Second-order features were calculated from GLCMs measuring the grey-level distribution between pairs of voxels and high-order features were derived from three-dimensional matrices taking into consideration neighbouring voxels in adjacent planes. All these features have been previously described in detail [13,14] and the chosen parameters have previously shown utility and/or robustness when used in clinical 18 F-FDG PET data of cancers [17][18][19][20][21][22].
Statistical analysis was performed using SPSS (v22, Chicago, IL, USA) and MedCalc (v16.8.4, Ostend, Belgium) software. The data distributions were tested for normality using the Shapiro-Wilk test. As data were not normally distributed differences between benign and malignant tumours were tested with the Mann-Whitney U test for each parameter and correlations between parameters with Spearman correlation. Receiver operator characteristic (ROC) curves were also used to compare the ability of each parameter to classify tumours as benign or malignant and the area under ROC curves (AUROC) were calculated. Comparisons between AUROC were made as described by DeLong et al. [23]. Separate assessment was made by combining SUVmax with other parameters that did not show a correlation with SUVmax. Statistical significance was assumed when p < 0.05. Inter-observer variation was assessed with intra-class correlation coefficients (ICCs).

Results
Thirty patients had benign tumours and 24 had MPNSTs confirmed either histologically (n = 30) or by at least 5 years of follow-up (n = 24). Thirty-six symptomatic tumours were on the trunk and 18 in the extremities.
On 1.5-h scans, there was a significant difference between benign and malignant tumours for all SUV and other firstorder parameters, for none of the second-order parameters and for all four high-order parameters. At 4 h, the results were the same, except second-order entropy was significantly different; high-order contrast was not (Table 1). Only percentage change SUVmean and SUVpeak showed significant differences between benign and malignant lesions (Table 1). For ROC analysis, SUV and other first-order parameters, second-order entropy and all high-order parameters showed ability to discriminate at 1.5 and 4 h (except high-order contrast at 4 h; Table 2). SUVmax showed the highest AUROC at 1.5 h (0.992) and SUVpeak at 4 h (0.997), closely followed by SUVmax (0.996). SD showed the best discrimination from the other first-order features (0.967 and 0.99 at 1.5 and 4 h, respectively; Fig. 2). Coarseness showed the best discrimination from the high-order features (0.894 and 0.888 at 1.5 and 4 h, respectively; Table 2; Fig. 3). The percentage change in SUVmean and SUVpeak showed some discriminatory ability (AUROC 0.722 and 0.688, respectively; Table 2).
Most parameters showed significant correlations with SUVmax except the GLCM parameters and NGTDM contrast. GLCM parameters performed poorly in discriminating tumours and, so, were not further assessed, but the combined parameter SUVmax/NGTDM contrast was further evaluated to see if there was incremental value from this combination (Tables 1 and 2). Whilst combining the parameters in this way showed a better performance than NGTDM contrast alone, it did not show any additional value over SUVmax.

Discussion
This study has shown that MPNSTs in patients with NF1 display greater heterogeneity of 18 F-FDG uptake than benign symptomatic neurofibromas as measured by a number of global first-order features (including SD, entropy and uniformity) as well as local high-order features (including coarseness, contrast, busyness and complexity). To our knowledge, only qualitative measures of heterogeneity have previously been described in this scenario where a qualitative heterogeneity score showed similar sensitivity but lower specificity to SUVmax [7]. With regards to other primary soft tissue tumours, a previous study has shown that heterogeneity parameters from 18 F-FDG PET can differentiate benign from malignant musculoskeletal tumours better than SUVmax (p = 0.004) [24]. Another study showed that heterogeneity of 18 F-FDG uptake and tumour grade in sarcomas were the  SUV standardised uptake value, SD standard deviation, GLCM grey-level co-occurrence matrix, NGTDM neighbourhood grey-tone difference matrix Table 2 Area under receiver operating characteristic curves (AUROC), sensitivity, specificity, PPV, NPV and accuracy at 1. SUV standardised uptake value, SD standard deviation, GLCM grey level co-occurrence matrix, NGTDM neighbourhood grey tone difference matrix, PPV positive predictive value, NPV negative predictive value only independent prognostic factors predicting overall survival (p < 0.001 and 0.004, respectively), whereas SUVmax and tumour type were not [25]. It is hypothesised that increased heterogeneity of 18 F-FDG uptake within tumours is related to variations in cell density and proliferation as well as more heterogeneous underlying biology including angiogenesis and hypoxia and this is why heterogeneous tumours behave more aggressively [26,27].
Our study also showed that MPSNTs showed significantly higher 18 F-FDG accumulation compared to benign neurofibromas as measured by SUV parameters, a finding that has been previously reported [2][3][4]. Whilst SUVmax showed excellent ability to discriminate MPNSTs from symptomatic benign neurofibromas as determined by AUROC (0.992, 0.996 at 1.5 and 4 h, respectively), the SUVmax AUROC was not significantly different from SD, entropy or uniformity,  Table 2 for AUROCs. There was no statistically significant difference between SUVmax AUROC and the other first-order parameter AUROCs (all p > 0.05) Fig. 3 ROC curves for SUVmax and high-order parameters (coarseness, contrast, complexity, busyness) at 1.5 h. See Table 2 for AUROCs. There was a statistically significant difference between SUVmax AUROC and the other high-order parameter AUROCs (coarseness p = 0.019, contrast p < 0.0001, busyness p = 0.0009, complexity p = 0.0002) but was significantly higher than all high-order features ( Table 2; Figs. 1 and 2). The percentage change in SUV and heterogeneity parameters between 1.5-and 4-h scans did not show any superiority in discriminating benign from malignant tumours compared to the parameters alone.
Our study is potentially limited by its retrospective nature, but our results should be representative as this was a cohort of patients referred for clinical assessment of symptomatic neurofibromas that were suspected of malignant transformation. However, it may not necessarily be possible to extrapolate the findings to other tumour types. Whilst semi-automated methods of tumour segmentation on 18 F-FDG PET images are preferred and are likely to show even better interobserver variation, we were unable to apply these methods due to difficulty in defining tumours with low uptake on the PET scans. Nevertheless, VOI definition from the CT images proved straightforward and with good inter-observer reproducibility. In addition, whilst all image sets were checked qualitatively for registration of the PET and CT data by an experienced observer, we cannot exclude small amounts of mis-registration due to patient movement.

Conclusion
In patients with NF1, MPNSTs showed greater heterogeneity and greater levels of 18 F-FDG uptake than benign symptomatic neurofibromas. First-order heterogeneity parameters were as discriminative as SUVmax. Although high-order features also showed the ability to differentiate benign and malignant tumours, these had lesser discriminatory ability compared to SUVmax.