Automated lung cancer assessment on 18F-PET/CT using Retina U-Net and anatomical region segmentation

Objectives To develop and test a Retina U-Net algorithm for the detection of primary lung tumors and associated metastases of all stages on FDG-PET/CT. Methods A data set consisting of 364 FDG-PET/CTs of patients with histologically confirmed lung cancer was used for algorithm development and internal testing. The data set comprised tumors of all stages. All lung tumors (T), lymphatic metastases (N), and distant metastases (M) were manually segmented as 3D volumes using whole-body PET/CT series. The data set was split into a training (n = 216), validation (n = 74), and internal test data set (n = 74). Detection performance for all lesion types at multiple classifier thresholds was evaluated and false-positive-findings-per-case (FP/c) calculated. Next, detected lesions were assigned to categories T, N, or M using an automated anatomical region segmentation. Furthermore, reasons for FPs were visually assessed and analyzed. Finally, performance was tested on 20 PET/CTs from another institution. Results Sensitivity for T lesions was 86.2% (95% CI: 77.2–92.7) at a FP/c of 2.0 on the internal test set. The anatomical correlate to most FPs was the physiological activity of bone marrow (16.8%). TNM categorization based on the anatomical region approach was correct in 94.3% of lesions. Performance on the external test set confirmed the good performance of the algorithm (overall detection rate = 88.8% (95% CI: 82.5–93.5%) and FP/c = 2.7). Conclusions Retina U-Nets are a valuable tool for tumor detection tasks on PET/CT and can form the backbone of reading assistance tools in this field. FPs have anatomical correlates that can lead the way to further algorithm improvements. The code is publicly available. Key Points • Detection of malignant lesions in PET/CT with Retina U-Net is feasible. • All false-positive findings had anatomical correlates, physiological bone marrow activity being the most prevalent. • Retina U-Nets can build the backbone for tools assisting imaging professionals in lung tumor staging. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-022-09332-y.


Introduction
It is well recognized that the positron emission tomography (PET) component of PET/CT with 18F-fluorodeoxyglucose (FDG) as radiotracer bears important metabolic information for tumor characterization and staging of lung cancer [1,2]. Consequently, FDG-PET/CT became the standard for the diagnostic workup of patients with lung cancer [3]. But despite advancing technologies, imaging-based staging is still a challenging task [4] and both underdiagnosis and overdiagnosis are common problems [5]. Appropriate treatment is based on correct staging [6]. Due to the fast progression of the disease [7], initial mis-staging has a particularly negative impact on patients' quality of life. Automated lung tumor detection using artificial intelligence (AI) algorithms has shown promising results on high-resolution CT [8,9]. However, particularly detection of advanced tumors on the CT component of PET/CTs with a slice thickness of 3 mm and acquisition in free-breathing technique is challenging for algorithms and currently not accurate enough [10]. Given this and the hybrid nature of PET/CT, it is advisable to use both PET and CT information during development of AI algorithms. So far, only few researchers have done so, and all focused solely on lung lesions (T) or lung nodules [11][12][13][14]. Further developments towards 3D detection of malignant lesions of different kinds (tumor, lymph node metastases, distant metastases) within the complete scan volume are needed to result in a clinically relevant, comprehensive solution assisting radiologists in TNM staging.
Recently, P. Jaeger and colleagues presented an application based on Retina U-Net for end-to-end object detection on medical data. It showed good results for the detection of breast cancer on MRI in both 2D and 3D [15]. Performance was superior to other approaches such as Mask R-CNN. For the first time, this approach is adapted, trained, and tested on a clinical data set of 384 PET/CTs of patients with histologically confirmed lung cancer of all stages. Furthermore, the algorithm is combined with an anatomical region segmentation to classify detected lesions into T, N, and M categories. We hypothesize that Retina U-Net is an effective tool for detection of T, N, and M lesions on PET/CT.

Material and methods
Written informed consent for this retrospective study was waived by the regional ethics committee (Ethikkommission Nordwest-und Zentralschweiz).

Case selection: internal data sets
We identified FDG-PET/CTs of patients with histologically proven primary lung cancer acquired at our institution between 01/2010 and 12/2016. Selection criteria were protocol name, time period, and verified tumor histology according to the pathology archive. This resulted in 364 FDG-PET/CTs. Figure 1 shows the study workflow.

Imaging protocols and reporting
PET/CT scans were performed on two integrated PET/CT systems: a Discovery STE with 16-slice CT (GE Healthcare) and a Biograph mCT-X RT Pro Edition with 128-slice CT (Siemens Healthineers). Scans were obtained 1 h after intravenous injection of 5 MBq FDG/kg body weight at glycemic levels below 10 mmol/L and previous fasting for at least 6 h. Technical details are provided in Supplement A. The clinical PET/CT reports had been created by residents in nuclear medicine and reviewed and signed by a board-certified radiologist and nuclear medicine physician in consensus.

TNM annotation and image segmentation
After de-identification, the PET/CT of each patient was opened in a locally installed segmentation software (3D-Slicer, version 4.6.2). The TNM classification differentiates four main T categories (T1-T4; depending on size and features such as invasiveness), three N categories (N1-N3, depending on location), and one M-category [16]. As shown in Table 1, this study used the official TNM classification, slightly simplified by leaving out the sub-subcategories (e.g., T1a, T1b). Annotation and 3D image segmentation with reference to the anonymized written PET/CT report were performed by a

General algorithm characteristics: Retina U-Net
Retina U-Net is a state-of-the art approach in medical object detection [15]. It recaptures training signals pixel-wise from segmentation supervision. The architecture used in this study is characterized by additional branches in the lower decoder levels for end-to-end object classification and bounding box regression (Fig. 2). Figure 3 shows the schematic data processing workflow. For the first time, the algorithm has been adapted and trained to detect T, N, and M lesions on PET/CTs.

Training, validation, and testing
The complete internal data set was randomly split into a training (60%), validation (20%), and internal test set (20%). Intensity values were clipped at [−600, 1200], rescaled to [0, 1], and zscore normalized. Images were cropped on the z-axis to slices containing lung tissue according to a lung segmentation based on intensity-thresholding and connected component analysis. As the task of the algorithm is object detection, bounding boxes were created based on the manual 3D segmentations for training. Two other approaches including only T or only T&N lesions were also assessed, but performed worse and were not pursued further (see Supplement B). Intersection-over-  The encoder-decoder structure resembles a U-Net. This segmentation model is complemented by a detection network for classification (cl) and bounding box regression (bb) operating on the lower (coarser) levels of the architecture so as to exploit object level features. The green arrows indicate the high quality training signals being backpropagated from an auxiliary segmentation task. From [15] union threshold was set to 0.1. Due to limitations in graphics processing unit (GPU) memory, 3D images were processed patch-wise. During training, patches containing foreground objects were oversampled to ensure the balance between foreground and background patches. Extensive data augmentation in 3D was applied to account for overfitting. At test time, the algorithm was applied to overlapping patches over the entire image and resulting predictions were consolidated. Detection rates (= sensitivity) and FP rates per case were calculated. Table 2 provides information on important algorithm parameters. Supplement C provides further technical details.

TNM categorization
In a subsequent step, all test cases were processed with a publicly available deep-learning lung segmentation algorithm with excellent performance (DICE Score: 0.98 ± 0.03) [17] and the following five anatomical regions were defined: (1) lung region (= segmentation mask resulting from the algorithm), (2) mediastinum (area between the two lung masks), Attribution accuracy was evaluated on the internal and external test set with original annotation labels as ground-truth.

Analysis of reasons of FP and FN findings
All FP predictions of the Retina U-Net were visually checked and attributed to categories. FPs were assessed at the lowest  Preprocessing (Clipping and scaling to [−1200, 600] followed by z-score normalization *The optimal configuration of patch size versus batch size is a diligent task given the hardware constraints of limited GPU memory. Various combinations of the two parameters were tested on the validation set and the best model selected (batch size: 8, patch Size: 192,192,32) classifier threshold possible (≥ 0.1) and at the threshold used for sensitivity analysis, which was defined by the threshold at which the FP/c rate drops below 2. We consider this to be a FP rate per case that is acceptable in a clinical environment. We conservatively rated repeated annotations of TP findings as FP. For example, if a lung tumor was detected three times, this was rated as 1 TP and 2 FP. The category (T, N, and M) of FNs was determined using the original lesion labels as ground truth.

External validation
In addition to the internal test set, a total of 20 randomly selected data sets from patients with histologically proven primary lung cancer that underwent FDG-PET/CT (Biograph40, Siemens Healthineers; technical details in Supplement A) between 06/2021 and 09/2021 in another university hospital were processed with the algorithm after deidentification. Before processing, 3D ground-truth bounding boxes for all malignant lesions were drawn by R1 and R2 in consensus. Sensitivity and FPs per case were calculated and compared to the results based on the internal test set.

Statistical analysis
Statistical analysis was performed using IBM SPSS Statistics for Windows, Version 22.0 (IBM Corp.). Scatter plots and graphs were created with scikit-learn [18]. Continuous data was described by means and standard deviation. Association between two or more categorical variables was tested with chi-square test. To test for statistical differences of means of continuous data of two or more groups, ANOVA or nonparametric alternatives were used depending on data structure. Normal distribution was assessed with histograms and Q-Q plots. p values less than 0.05 were defined to indicate statistical significance. Table 3 provides details on patient characteristics and tumor histology. Age and sex did not differ statistically significantly between training, validation, and internal test set (p = 0.32 and p = 0.50, respectively). There were no significant differences between the three internal data sets regarding tumor histology (p = 0.72). Patients from the internal and external test set were of comparable age (p = 0.61), but more female patients were included in the external test set (p < 0.01) and the tumor histology mix was different (p < 0.01).

Internal performance analysis
In this section, all results are based on the internal test set. The T subcategory analysis revealed that sensitivity for more advanced lesions (T2/T3/T4) was higher compared to smaller T1 lesions: while all T3 lesions were detected (100% (95% CI: 73.5-100.0)), T1 tumors were more likely to be missed (detection rate 75.0% (95% CI: 56.6-88.5)). Sensitivity for N lesions wasere 54.2% (95% CI: 47.3-60.9) and metastases (M) were detected in 72.4% of cases (95% CI: 60.9-82.0). Figure 4 shows the FROC curves for the T detection task. Figure 5 shows an exemplary PET/CT of a patient with bone metastases and algorithm outputs as overlays. Finally, ESM 1 shows a video of detected lesions indicated as overlays on an original data set.  Table 4 provides a comprehensive analysis of reasons of FPs. Figure 6 shows four examples of FP findings.

Analysis of false-negative findings
Of 76 missed findings, 60 were lymph nodes (78.9%), 10 metastases (13.2%), and 6 tumors (7.9%; T1: n = 4; T2: n = 1; T3: n = 0; T4: n = 1).  T4). b A corresponding partial confusion matrix for the T lesion detection task. Please note that this is not a standard classification task, but an object detection task ("object versus background"), where by definition true negatives (TNs) are not defined (there are no "background objects"). Therefore, a number for TN is lacking by definition

Discussion
Our analysis of a Retina U-Net algorithm revealed good overall detection rates of malignant lesions on PET/CT at low false-positive rates per case. In this scenario, the Retina U-Net integrates the information from both CT (anatomical) and PET (metabolic) components. Slightly lower detection performance for T1 tumors was noticed, compared to larger T2-T4 lesions. This is an interesting finding and is most probably explained by the fact that small tumors are affected by partial volume effect, which is especially relevant in PET with lower spatial resolution compared to high-resolution lung CT. Furthermore, the CT component of FDG-PET/CT is routinely acquired in free breathing and motion artifacts exhibit more influence on the detection of small lesions compared to advanced tumors.
Regarding automated lesion detection on PET/CT, Teramoto et al evaluated a method for detection of pulmonary nodules (average diameter: 19 mm) in 104 patients [12]. The nodules were detected on CT and PET images separately, based on active contour filtering and thresholding, respectively. After using a convolutional neural network (CNN) for FPreduction, the FP/c was 4.9 and the detection rate was 90.1%. In opposition to this multi-step, mainly hard-coded processing pipeline, the Retina U-Net presented in this article is an endto-end approach, which has the advantage of potential optimization by adding more training data. Schwyzer et al trained a model to perform a binary classification whether there is a nodule present in a single CT image or not [13]. This approach has strong limitations such as discarding all 3D information by only operating on 2D slices and incapability of handling multiple lesions per slice. Furthermore, training with whole slice binary labels only as opposed to annotations of individual nodules and their associated locations is known to be data inefficient and prone to overfitting, since the algorithm requires large amounts of data in order to relate the binary label  [19]. They reached an accuracy of 69.1%, sensitivity of 70.0%, and specificity of 66.7%. Our approach is more granular and also encompasses N and M lesions. Furthermore, it resulted in higher sensitivity on both internal and external test data. An interrater variability analysis showed slightly higher levels of agreement compared to Borrelli et al [20].
Our approach yielded highest detection rates for T lesions. Detection rates of N and M lesions were lower, probably explained by the anatomical surroundings that provide a stark contrast for T lesions (high attenuation [tumor] vs. low attenuation [lung parenchyma]), but not so much for N and M lesions that are surrounded by tissues of similar density. All false positive detections had an anatomical correlate, such as the physiological metabolism of the myocardium. The fact that no arbitrary lesions were detected reinforces our trust in the Retina U-Net algorithm and seems to be in line with radiological judgment and decision-making. Of note, the FP/c at the chosen threshold dropped from 2.0 to 1.1 when removing those FPs that were caused by double annotations of true lesions-an approach that seems fair keeping in mind that it is technically easily implementable.
External validation yielded better results than internal testing. This is unusual, but can be explained by the different structure of the external data set regarding tumor histology and lesion type mix: its T vs. N/M ratio was higher compared to the internal test set. Other factors influencing the performance on the external data set are differences in hardware (PET/CT scanner) and, given the small number of examinations in the external data set, coincidence. While the internal data set comprised 364 examinations and is among the largest used in the context of AI and PET/CT, the task of lesion sub-categorization drastically decreases the amount of training cases per category. Thus, for this initial study, we opted for the simpler task of general lesion detection, i.e., grouping lesions into one foreground class to be distinguished from background, and perform detailed TNM-classification with a second approach (anatomical regions). We are aware that this has the disadvantage of introducing an extra layer of uncertainty that decreases general performance of the whole processing pipeline, although the anatomical regions approach correctly attributed 94.3% of detected lesions.
In the future, the algorithm could serve physicians by reducing false-negative calls, reducing the time needed to analyze staging PET/CTs, and allowing for advanced evaluations such as automated tumor volume quantification.
Our study has limitations. First, ground-truth data originated from one center and lesions were manually segmented by two readers in random order without double reading, but with supervision. However, literature reports high ICCs between human readers for tumor delineation in PET/CT with ICCs ranging between 0.987 and 0.995 [21], which was confirmed in this study by high IoU values on a subset of 60 lesions. Second, volume cropping to the chest was applied and therefore extra-thoracic lesions were not completely considered. This has the advantage of avoiding FPs caused by nontumorous organs with physiologically high PET signal like the urinary bladder or the brain. At the same time, the algorithm has not been tested for detection of metastases (N, M) in the whole body. Third, even though the data set was large, the 95% CIs were wide. This is because the internal test set encompasses only 20% of the whole internal data set. Fourth, the performance, especially regarding N and M lesions, is currently not good enough for a stand-alone application and algorithm results warrant validation by a physician trained in nuclear medicine.
In conclusion, the Retina U-Net algorithm is well suited for the 3D detection of lung cancer lesions in PET/CT. To further advance the methodology towards clinical application, the approach will be expanded to the whole body. To this end, besides a swift and intuitive workflow allowing for modification of automatically generated results, more extrathoracic annotations are needed.