Background

Neuroendocrine neoplasms (NENs) originate from neuroendocrine cells with the primary tumor most often located in the lungs [1], gastrointestinal tract or pancreas [2]. The clinical course for patients diagnosed with NEN ranges from indolent to highly aggressive. NENs are often slow-growing and due to vague symptoms tumors are often diagnosed at a late stage with metastatic disease. Most NENs have overexpression of somatostatin receptors [3], which can be used for tumor imaging. Positron emission tomography (PET) for somatostatin receptor imaging (SRI) combined with anatomical imaging, e.g., computer tomography (CT), is essential for diagnosing, staging, treatment selection and follow-up of patients with NEN [4].

Previously, the use of maximal standardized uptake value (SUVmax) in SRI PET of patients with NEN has been shown to be prognostic for progression-free survival, but not overall survival [5,6,7,8]. The degree of somatostatin receptor expression determined by SUV is linked with tumor differentiation, i.e., less somatostatin receptors in the dedifferentiated more aggressive tumors. Thus, the lesion with the lowest tracer uptake, rather than SUVmax, would therefore be expected to be better linked with prognosis. Indeed, we recently reported that by use of minimum lesion SUVmean and total tumor volume, the prognostic value of SRI with [64Cu]Cu-DOTATATE PET in patients with NEN could be greatly increased [9]. However, to obtain total tumor volume and minimum SUVmean, total tumor segmentation is required. Due to the often widely metastatic disease in patients with NEN this is time-consuming with a manual approach, thus hindering clinical implementation. We, therefore, introduced a standardized semiautomatic method for total tumor segmentation that could be performed with a median time of approximately 20 min depending on tumor burden. However, to increase the likelihood of clinical implementation of these prognostic measures, we believe further automatization and faster tumor segmentation is needed. We, therefore, set out to employ a deep-learning approach for total tumor segmentation in [64Cu]Cu-DOTATATE PET of patients with NEN. One widespread method for image segmentation is using the U-net structure for the artificial intelligence (AI) network [10], which has been used for, for example, segmentation of bone metastasis in breast cancer patients [11], of cervical tumors [12] and of gliomas [13] in PET/CT. To the best of our knowledge, this has, however, not been applied to tumor segmentation in patients with NEN. Adapting a network of the U-Net architecture could therefore be a promising step in further automatization of tumor segmentation of NENs.

Hence, we aimed to implement and evaluate an AI model for tumor segmentation of NENs and determine if the performance of the model could prove useful for assisting or replacing our previously proposed standardized semiautomatic method for total tumor segmentation. Further, the model was implemented into clinically used software.

Methods

Patients

Dataset 1

From 2009 to 2013, we retrospectively included 127 patients available with histopathologically confirmed NENs that underwent [64Cu]Cu-DOTATATE PET /CT performed 1 h after injection of approximately 200 MBq [64Cu]Cu-DOTATATE PET. All images were acquired with a Siemens Biograph 40 or 64 TruePoint PET/CT and reconstructed with TrueX algorithm (Siemens Medical Solution). Settings were as follows; 3 iterations, 21 subsets, 2 mm Gaussian post-filter smoothing, 336 × 336 matrices of 2 × 2 × 3 mm3 voxels. CT-based attenuation correction was applied. A diagnostic quality CT scan with iodine intravenous contrast was performed before the PET. To ensure quantitatively accurate measurements between the different PET/CT scanners, we perform a quality control every 2 weeks, testing they are calibrated to measure within our acceptance range (5%). We excluded patients with no signs of NEN due to previous radical surgery. For patients included; age, gender, site of primary tumor, Ki67 index and WHO grade (Grade 1 (Ki67 < 3%), Grade 2 (Ki67 3–20%) and Grade 3 (Ki67 > 20%) were recorded.

Dataset 2

From 2018 to 2019, we retrospectively included 31 patients with histopathologically confirmed NEN that underwent [64Cu]Cu-DOTATATE PET /CT. All PET/CT were acquired with a Siemens Biograph 128 mCT using the same setup and reconstruction routine as described above.

Dataset 3

Also from 2018 to 2019, we retrospectively included 10 patients with known NEN referred for [64Cu]Cu-DOTATATE PET/CT but found to have no signs of NEN on PET/CT due to radical treatment and therefore serves as negative controls in this study. Images were acquired as described for dataset 2.

To ensure quantitatively accurate measurements between the different PET/CT scanners, we perform a quality control every 2 weeks, testing they are calibrated to measure within our acceptance range (5%). A Danish Patient Safety Authority approval was obtained (31-1521-453), and obtained informed consent was waived for the included patients.

Ground truth label methodology

Ground truth labels were created using a previously described standardized procedure for segmentation of NENs in [64Cu]Cu-DOTATATE PET /CT images [9]. In brief, a region of interest (ROI) is drawn within the normal liver on the PET image. The SUVmean value of this ROI is then used to calculate a threshold value for the rest of the image. The threshold value is given as

$$Threshold = \left( {1.5 \cdot SUV_{mean} } \right) + \left( {2 \cdot SD} \right)$$

with SD being the standard deviation within the ROI. In case of complete metastatic liver involvement the ROI is drawn in the spleen. The threshold formula based on normal spleen uptake has been adapted to the higher physiological uptake in the spleen and is given as

$$Threshold = \left( {0.67 \cdot SUV_{mean} } \right) + \left( {2 \cdot SD} \right)$$

Consequently, every voxel from the PET image with a value larger than the threshold is by default segmented. To reduce segmentation due to noise, all volumes < 0.1 mL (< 9 voxels) are automatically removed. Finally, a physician (E.A.C) manually corrected the segmentation by removing volumes with high physiological uptake (typically bladder, adrenal glands and kidney) and false-positive segmentations yielding the ground truth label.

Convolutional neural network

We applied the U-net architecture which involves a de-convolutional path for extraction of features in the images and an up-convolutional path for localization of the features (Fig. 1). The well-established nnU-Net framework was used for automatically setting up the U-net including preprocessing, training, inference and post-processing of data [14]. We used default nnU-Net settings for training the network with PET and CT images as input, and our ground truth label as target. Details regarding the preprocessing, network architecture and hyperparameters are shown in Supplementary Table 1.

Fig. 1
figure 1

Total tumor segmentation of neuroendocrine neoplasms by a convolutional neural network

Training and validation of the network

For training and validation, a randomly selected subset of 117 [64Cu]Cu-DOTATATE PET/CTs from dataset 1 were used (Fig. 2). Cross-fold training was performed in accordance with the nnU-Net procedure. The training results in a model from each fold, where individual pixels are predicted to be either segmented or not with some probability by each model. An ensemble of the models was created by averaging these probabilities before each pixel was assigned to be segmented or not in the ensemble model. Post-processing with removal of any segmentation outside the patient’s body was performed using the PET signal to automatically outline the body contour. Furthermore, segmentations < 0.1 mL (< 9 voxels) were automatically removed.

Fig. 2
figure 2

Illustration of data used for training and test. The models and the ensemble hereof were inferred on the test cohort. Boxes are not drawn to scale

Testing of models

The remaining 10 patients from dataset 1 (not used in training) and 31 patients from dataset 2 constituted the test cohort. Patients from both dataset 1 and 2 were used for testing to control for possible effect of different PET/CT systems used. The automatically post-processed models were evaluated by the following metrics in the test cohort.

$$Sensitivy = \frac{TP}{{TP + FN}}$$
$$Precision = \frac{TP}{{TP + FP}}$$
$$Dice = \frac{2 \cdot TP}{{2 \cdot TP + FP + FN}}$$

where TP, FP, TN and FN denote true/false positive/negative. These metrics were used to select which model should be further evaluated. The selected output were reformatted into DICOM RT structures and imported into Mirada DBx 1.2.0 software package (Mirada Medical Ltd., Oxford, UK). An arbitrary scoring scheme (Table 1) focusing on number of false positive and false negative lesions was applied to evaluate the automatically post-processed AI segmentation in the test cohort (n = 41). Time spent on any necessary manual correction was recorded in a randomly selected subgroup (n = 10), blinded to, and compared with time spent on ground truth labeling. Scoring of the AI segmentations in the test cohort was performed by a physician (E.A.C). Additionally, the AI model was tested on patients from dataset 3, i.e., patients without any signs of NEN on PET/CT due to radical surgery. These patients were judged by the number of segmentations (if any) and volume of the segmentation.

Table 1 Evaluation scheme of AI segmentations

Statistical analyses

Data are reported as mean and standard deviation unless otherwise indicated. To assess difference between groups we used chi-squared test for categorical data and unpaired and paired t-test for continuous data, as appropriate. A P-value < 0.05 was considered statistically significant. Statistical analyses were performed in R version 3.6.0 (R Foundation for Statistical Computing).

Results

Patients

The datasets for training and testing the models were similar in regards to age, gender and site of primary tumor. Patients most often had small intestinal primary tumors. Patients in the test cohort were more often classified as WHO Grade 2, but the Ki67 index did not differ significantly (Table 2). All but one patient in the test cohort had multiple lesions, with a median of 25 lesions (interquartile range: 41).

Table 2 Demographic data for patients with neuroendocrine neoplasms

Output from nnU-Net

The models, as well as the ensemble of the models, were evaluated using the defined metrics in the test cohort (n = 41). Since no model performed best in regards to both pixel and lesion-wise metrics (Table 3), the ensemble was chosen for evaluation of FP/FN lesions. Because the ensemble is a combination of the different models, it comprises more variation in the dataset and should be the more robust model/choice for new cases.

Table 3 Metrics for AI segmentations without manual adjustments applied to the test cohort (n = 41)

Evaluation of AI segmentations

All 41 patients in the test cohort were evaluated and scored according to the scoring scheme (Table 1) with the detailed description reported in Supplementary Table 2. Examples of two patients are shown in Fig. 3. In 7/41 (17%) patients the AI segmentation required no manual adjustments to obtain the final total tumor segmentation (Table 4). In 35/41 (85%) of patients, the AI segmentation was considered acceptable with either no or minor/major adjustment before the final total tumor segmentation was obtained. The time spent on any manual correction was compared between the output from the AI segmentation and from the standardized semiautomatic method used as ground truth. In the subgroup of 10 randomly selected patients, less time was needed to correct the segmentation when using the segmentation from the AI model (median 5 min) versus the semiautomatic method (median 17 min), p < 0.01 (Fig. 4). Further, the ensemble model was applied to the negative controls and in 5/10 (50%) no segmentations were seen, and in 9/10 (90%) cases, the volume segmented was ≤ 1.5 mL (e.g., physiological uptake in adrenal gland). The median segmented volume was 0.15 mL. In one negative control patient, 56 volumes were segmented with a total volume of 68.6 mL with almost all of the segmentations placed in normal liver tissue. No explanation for the false positive segmentations could be identified when reevaluating that particular patient.

Fig. 3
figure 3

Representative examples of the segmentations from the AI model for two patients. Maximum intensity projection [64Cu]Cu-DOTATATE PET without tumor segmentation (A, D). Ground truth segmentation of tumor (B, E). AI predicted segmentations—no manual adjustments performed (C, F). In the AI output, all separate segmentations are given a unique color, e.g., red, blue, green, making manual adjustment with deletion of a segmentation easy and fast (e.g., part of the bladder was erroneously segmented in C)

Table 4 Evaluation of number of false-positive/false-negative segmentations by AI without manual adjustments
Fig. 4
figure 4

Boxplot depicting time spent on manual correction to obtain the final total tumor segmentation

Discussion

Our most important finding was that time spent on the laborious task of total tumor segmentation in patients with NEN can be reduced from 20 to 5 min by applying a convolutional neural network. The majority of the automatically performed segmentations by the network were useable for obtaining total tumor segmentation with either no or only minor degree of manual adjustments. Hence, the method proposed in our study serves to further automate the process of total tumor segmentation. Recently, we and others have reported the prognostic power of total tumor volume determined in SRI PET [9, 15,16,17,18,19]. Furthermore, with total tumor segmentation, several metrics may be derived in the setting of multiple lesions, e.g., the lesion with the lowest SUVmean can easily be determined. Underscoring the value of total tumor segmentation, we recently reported the prognostic value of a combined approach of total tumor volume and minimum SUVmean [9]. The reduced time needed for performing total tumor segmentation presented in the current study may further improve the clinical adaptation and feasibility in the clinical routine.

In our train cohort, we trained the nnU-Net convolutional neural network. In the test cohort, our proposed AI model without manual adjustments had a pixel and lesion-wise dice score of 0.801 and 0.850, respectively. This is comparable with other proposed models for tumor segmentation [13, 20, 21]. More importantly, by implementing the segmentation in a clinical imaging viewer, the segmentations could be judged, adjusted and verified by a physician. Of note, by reformatting the output to DICOM RT structures, it could easily be integrated into other clinical imaging viewers. From the evaluation it was seen that most (46%) of the AI segmentations of the 41 patients were scored as “Acceptable with minor adjustments.” Only 15% were scored as non-usable due to the number of manual adjustments needed, leaving 85% of the segmentations as useable for obtaining total tumor segmentation. Also, in the negative control cases the model performed well by segmenting a volume of < 1.5 mL in 90% and no segmentations in 50% of the 10 cases. From both the test cohort and negative controls, however, it is evident that the proposed model cannot stand alone. Manual adjustments are necessary in some cases, and the AI segmentations should always be examined by a physician. Compared with the previously proposed methodology for a standardized semiautomatic segmentation, the AI model serves as an improved starting point for the physician, and may be corrected in 5 min as compared to the previously reported 20 min [9]. The AI predicted segmentations should be viewed as an aid for performing total tumor segmentation, which, however, ultimately is determined by the physician performing the adjudication. The AI predicted segmentations should not give rise to altered clinical staging of the patients as compared to a manual total tumor segmentation approach, unless true positive lesions were overlooked by a manual approach. The manual adjustments may involve removing false positive segmentations, e.g., the adrenal gland or bladder, and/or adding lesion(s) not included in the segmentation. With the unique label for each segmented volume entire false positive segmentations are easily removed. The manual addition of single lesions can be done using several approaches, e.g., as seed point, where a seed is placed in the lesion, and a SUV threshold is used for determining the outline of the lesion. The threshold may be based on the standardized method we previously proposed, where the patient specific threshold is a based on tracer uptake in the normal liver. Other methods include a fixed threshold or adaptive threshold (e.g., 40% SUVmax).

As the model is derived from a training cohort of 117 patients, it is evident given the large variation in primary tumor site and sites of metastatic lesions that the model may be further improved by having larger training cohorts. By implementing the model for segmentation, new training cohorts may be more easily generated given the reduction in time spent on performing tumor segmentation. Ideally, the model should also be validated in an external cohort. [64Cu]Cu-DOTATATE was developed and clinically implemented at our department at Rigshospitalet, Denmark, but has not been available at other sites until the recent Food and Drug Administration approval [22]. It is therefore likely that the proposed model could be externally validated in the future. The trained AI model may not be directly applied to a setting with another PET tracer for SRI, e.g., 68 Ga-DOTATOC, due to variations in uptake patterns in both normal tissues and lesions [23]. Some issues should be addressed regarding the proposed AI model. For both ground truth segmentation and by post-processing of the model output, all segmentation < 0.1 mL were automatically removed. With a voxel size of approximately 2 × 3 × 3 mm3 = 0.012 mL, the deletion of all lesions below 0.1 mL corresponds to lesions below 9 voxels. This was done to reduce segmentation of noise. Furthermore, in the training and test cohorts small intestinal and pancreatic primary tumors were most frequent; hence, the AI model may perform best in such cases.

Conclusion

We implemented and validated an AI method that achieved a high concordance with the ground truth label and resulted in much faster tumor segmentation. In the majority of patients, the AI segmentation was useable with either no or minor manual adjustments required. By applying this approach of total tumor segmentation, prognostication by quantitative measures of [64Cu]Cu-DOTATATE PET may become more feasible and implemented in the clinical routine.