Study dataset
This is a retrospective analysis of imaging data from the NLST. Trial design and eligibility criteria are described elsewhere [15]. In brief, NLST was a multicentre randomised trial of three rounds of screening with low-dose CT compared to chest radiography for asymptomatic participants aged 55–74 years with a significant smoking history. Participants were followed for lung cancer diagnoses for a median of 6.5 years. Nodule size was measured using electronic callipers by NLST radiologists who had received training in standardised image interpretation. No standard protocol for nodule evaluation was mandated.
Of the 26,722 patients in the CT screening arm of the NLST, 16,684 were excluded as no abnormality was recorded in the NLST database. In our study, CT studies from 10,038 patients with recorded abnormalities were reviewed under the supervision of an experienced radiologist to identify pulmonary nodules. Each nodule was manually annotated and its correspondence to an abnormality found during NLST recorded. All time-points were considered and nodules were tracked over time. Eighty-two patients had no recorded abnormalities that could be matched to a CT finding. Two hundred fifty-two patients with a diagnosis of cancer were excluded because their cancer diagnosis could not be matched to a specific nodule. Nodules that were not solid or part-solid were excluded (n = 1233 patients) because the LCP-CNN was trained on solid and part-solid nodules only. Nodules < 6 mm or > 30 mm using manual measurements were excluded (n = 3007 patients). Nodules < 6 mm were excluded because these do not routinely warrant surveillance according to the Fleischner Society, and masses > 30 mm were excluded because the online Brock calculator and segmentation algorithm were not designed for masses > 30 mm [3, 16, 17]. In total, 4660 participants with 10,485 nodules were included in the analysis. The study flow diagram is provided in Fig. 1.
Automatic nodule size measurements
The U-Net convolutional neural network is a well-established medical segmentation tool that was adapted for nodule segmentation within this study [16, 18]. In total, 1276 participants were randomly selected to train the segmentation algorithm. Of these, participants meeting the inclusion criteria for the analysis in this study (n = 730) were excluded from the validation cohort.
Volumetric segmentation was initiated from a seed point within the nodule identified by doctors under the supervision of a senior chest radiologist (F.V.G.), and then performed by the algorithm in an unsupervised manner. Equivalent spherical diameter was calculated using ∛(6/π.V), where V is nodule volume. Two different methods were used to measure maximal axial diameter. In the first, the longest distance between any two points on the nodule boundary was calculated on each axial slice, and the maximum among all axial slices was used. However, this method can overestimate the diameter of spiculated nodules. In the second method, the largest diameter was calculated for an ellipse fitted to each axial contour using standard least squares methods. Both methods gave almost identical results. We have reported only the second because it is less sensitive to spiculation and small changes in nodule geometry, in line with Fleischner Society recommendations [19]. Some of these results have been previously published in the form of an abstract and conference proceeding [20, 21].
For each participant, the Brock model was used to calculate risk of malignancy using (1) manual diameter provided in the NLST dataset, (2) maximal axial diameter derived from automatic segmentation, and (3) equivalent spherical diameter derived from automatic segmentation. The risk of malignancy was also derived using the LCP-CNN. The LCP-CNN development and validation are fully described in prior publications, and the same version of the model was used in this analysis [13, 14].
Predictive accuracy was primarily evaluated with area under the receiver operating characteristic curve (AUC) analysis. The statistical significance of any difference in accuracy between the methods was computed from the distribution of AUC differences. This was derived by bootstrapping across 10,000 draws from the data with replacement. 95% confidence intervals (CI) were obtained from the distribution of differences. p values were computed using a two-sided permutation test using 10,000 random resamplings of the data [22], with p < 0.05 considered statistically significant.
Information ablation
Covariates were removed from the full Brock model, and the predictive performance of three ‘feature-reduced’ Brock models was tested.
Age, sex, emphysema, family history of cancer, nodule location, and nodule count were included.
Nodule size, nodule type (solid or part-solid), and spiculation were included.
All covariates in the full Brock model, except spiculation, were included.
Unlike the Brock model, LCP-CNN does not consist of human-interpretable terms. Hence, feature removal was performed by ablating information from the CT images. As the LCP-CNN was not trained to analyse ablated CT images, an experimental AI model was trained to predict malignancy from ablated CT images using the dataset and same eight folds as were used to train the LCP-CNN model [13, 14]. For a given fold, three-quarters of the data was partitioned for training the AI model, one-eighth was partitioned for validation, and one-eighth was partitioned for testing. Each participant was assigned to be in the test partition in precisely one of each of the folds. Each fold had an approximately equal proportion of participants with malignant nodules. Each of the eight folds was associated with a single corresponding independently trained model. During analysis, the results of the eight folds were combined together to provide a set of cross-validation results for the entire dataset as described in prior publications [13, 14].
The predictive performance of the AI model was tested on unmodified and ablated CT images.
All information about the nodule was ablated. A region 15 mm away from the furthermost edge of the nodule margin towards the hilum was evaluated, comprising of an image containing background lung parenchyma but without the nodule being visible.
All information about the background lung and the nodule internal texture was ablated. Background lung was replaced with average lung density across all patients (− 825 Hounsfield units), and nodule internal texture was replaced with mean nodule density.
A sphere of the same volume as the nodule and with mean nodule density was implanted in the ‘parenchyma only’ model as described above.
As information ablation was carried out using mean nodule density, a subgroup analysis was performed to compare the predictive performance of the experimental AI model on part-solid and solid nodules.
Data analysis
Data analysis was performed using Python 3.8 installed on Ubuntu 20.04 with NumPy 1.19.4, scikit-learn 0.21.3, and pandas 0.23.4 libraries.