Background

Lung cancer is the leading cause of cancer mortality worldwide that claimed approximately 2 million lives in 2020 [1]. Despite the poor prognosis of lung cancer, early detection of malignant lung nodules (LNs) has been shown to substantially improve patient survival as these lesions are mostly amenable to curative interventions. Incidental LNs are reported to be present in 30-50% of all chest Computed Tomography (CT)-scans [2,3,4], making them a crucial resource for LN detection.

Due to the shortage of radiologists and increasing workload [5], actionable nodules might be overlooked during the radiological interpretation. Previous studies demonstrated an LN-detection sensitivity ranging between 65 and 85% [6,7,8] in individual readers, depending on the indication, nodule size, the expertise of the readers, and time allowed for assessment. Double reading would improve the detection rate to 70–95% [7, 9, 10]. However, in routine clinical practice, double reading is not customary, as it would be tedious and time-consuming.

Acknowledging this challenge, software companies have progressively delved into the development of Computer-Aided Detection (CAD)-systems as second reader, with some of them already commercially accessible. Promising diagnostic performances have been reported by vendors based on internal study results. While internal validation is a mandatory step in the field of software development, external validation provides significant added value by evaluating a CAD system’s performance using new, independent datasets. Regrettably, solely 6% of all developed algorithms are externally validated [11, 12].

Besides the lack of external validation, numerous vendors prioritize detection of LNs over automated capturing all clinically relevant characteristics such as calcification, spiculation and location. Moreover, texture type is one of the important characteristics in determining the malignancy risk of a nodule [13,14,15,16]. In fact, it is well demonstrated that a majority of persisting subsolid nodules (encompassing both ground-glass and part-solid nodules) represent lung adenocarcinomas across diverse phases [17].

To the best of our knowledge, there are only a few commercially available tools capable of assessing characterizations and only a small subset among them can differentiate between all three texture types. The characterization sensitivity of all these commercial algorithms is not externally validated nor publicly available.

In this study, the performance of a commercially available artificial intelligence (AI)-based CAD system, which asserts automatic detection and accurate classification of solid, part-solid, and ground-glass nodules in CT scans, is externally validated.

Methods

The institutional review board waived the requirement for informed patient consent for this retrospective study.

Study population

According to our sample size calculation (see Supplementary Material– Sample Size), our study required approximately 110 nodule-containing scans, alongside a similar number of non-nodule containing scans.

Chest CT scans taken between January 2020 and December 2021 at Erasmus Medical Center, a Dutch University Hospital were retrospectively evaluated. We performed a systematic query in our Picture Archiving and Communicating System (PACS) on chest CT scans of subjects aged ≥ 18 years, using the following keywords: “solid nodules”, “ground-glass nodule”, “mixed nodule” and “part-solid nodule” and their inflections. Among these, we selected scans aiming to realize the stratification criteria (see Supplementary Material- Table S1) to ensure all texture types were adequately represented. Exclusions encompassed scans with metal artifacts, excessive motion artifacts, disordered slices, > 10 nodules, slice-thickness > 5 mm, and absent radiologist reports. Scans were screened and included till the aforementioned stratification criteria were satisfied.

For our study, 263 consecutive scans were included based on the radiology report: 113 nodule-containing scans and 150 non-nodule-containing scans. Note that the final number may change after ground truthing (GT).

Reference standard

A pulmonary nodule was defined as a lesion sized 3–30 mm in diameter as indicated in the Fleischner glossary [15, 18, 19].

The reference standard was based on the expertise of three radiologists (all with at least eight years of experience), these are the initial radiology reporter, annotator (R1), and arbitrator (R2).

As mentioned earlier, CT examinations were included based on the initial radiology reports, from which nodule information was extracted. Thereafter, all included CT examinations were read by a radiologist (R1) using an advanced annotation platform RedBrick.AI (Claymonth, Delaware, USA). Blinded to the original report, R1 annotated nodules per slice and specified texture, location, and the presence of calcification or spiculation. The average diameter and volume of each nodule were calculated.

All annotations made by R1 were compared with the original radiology report. When a nodule was identified by R1 but not mentioned in the initial radiology report, or vice versa, it was considered to be a discrepant interpretation. These discrepancies were reviewed by a second radiologist (R2), who had access to both annotations. R2 was tasked with arbitrating between the original radiology report and R1’s findings.

All board-certified readers were blinded to CAD findings.

Artificial intelligence based CAD system

All scans were processed by qCT v1.1 (Qure.ai, Mumbai, India), a commercially available AI-based medical device [20]. This algorithm detects pulmonary nodules and provides information on their location, texture, calcification/spiculation status, as well as the average diameter/volume.

During internal validation, qCT demonstrated an 82% nodule-level detection sensitivity. Characterization sensitivities were 82% for texture, 91% for calcification, 82% for spiculation, and 96% for location.

More information about the qCT software can be found in the Supplementary Materials- Specifications CAD.

Statistical analysis

Descriptive analysis

After ground-truthing, scans were categorized into a nodule-containing and non-nodule-containing scan group. We compared demographic (age, sex), clinical (presence/absence of other chest abnormalities such as atelectasis, fibrosis, pericardial fluid etc.), and acquisition parameters (machine-type, scan-type) between these groups. Age was dichotomized at 55, considering prior research indicating increased LN occurrence above this age [21].

Variables for nodule-containing and non-nodule-containing scans were summarized using numbers and percentages. We used Chi-square tests to assess the differences between these scans. Throughout the entire study, we utilized a p-value threshold of ≤ 0.05 to determine the statistical significance of the results.

Detection analysis

Analysis was conducted at both scan- and nodule-level, with a focus on nodule-level. CAD findings were scored using the 3D-intersection-over-union (IoU) method.

At the scan level, a scan was classified as true positive (TP) if CAD detected at least one nodule with a volume overlap of at least 10% with a GT nodule [22]. If the overlap was less than 10% or if the CAD failed to detect nodules in the nodule-containing scans, it was considered a false negative (FN) scan. A scan was labelled false positive (FP) if CAD detected anything in the non-nodule-containing scans and true negative if it detected nothing in the non-nodule-containing scans [23, 24]. The modified Wilson score method was used to construct the 95% confidence interval (CI) of sensitivity, specificity and precision. False positive per image (FPPI) at scan level was calculated by dividing the number of FP scans by the number of nodule-containing scans. Empirical Area Under the Receiver Operating Characteristic (AUC-ROC) and Precision-Recall (AUC-PR) curve analyses, along with the F1-score, were employed to assess CAD’s overall performance. DeLong’s method was utilized to construct the 95%CI for AUC-ROC, while the Clopper-Pearson method was used for the F1-score’s 95%CI. Note that recall is the same as sensitivity.

At the nodule level each CAD finding was assessed individually using the 10%-IoU-criteria. A CAD finding was considered TP if at least 10% of its volume had overlap with the GT nodule; FN if CAD did not detect the nodule or if the overlap was < 10%, and FP in case of a finding within the non-nodule-containing scan group. Here, sensitivity was calculated, and 95%CI was constructed using the method described by Rao and Scott for correlated data [25]. To demonstrate the effectiveness of the CAD system on CT scans, the performance was evaluated using the Free-response Receiver Operating Characteristic (FROC)-curve.

Nodule-level sensitivity across subgroups was also reported, as well as the R1 sensitivity.

Characterization analysis

The CAD system characterises nodules, which is summarised with numbers and percentages for detected and missed nodules. To evaluate the accuracy of the CAD system for characterizing detected nodules, the sensitivity (along with 95% Wilson Score CI) was calculated.

The absolute errors between GT and predicted diameters and volumes were calculated and summarized using the mean. Bland-Altman plots were presented to visualize the agreement between GT and predicted quantifications.

Statistical analyses were conducted using R v4.1.2 (R Core Team, 2021) in RStudio v2022.12.0 + 353 (R Studio Team, 2021). An overview of all the used packages can be found in the section Statistical Analysis Packages of the Supplementary Material.

Results

Based on the radiology report, a total of 263 CT scans were included. After the first reading process, 230 were in agreement with the original report. For all 33 disagreement scans, R2 adjudicated. The overall workflow is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of the workflow: All CT scan reports of scans acquired between January 2020 and December 2021 were screened, based on the inclusion and exclusion criteria and case stratification. All scans were included till satisfactory of the case stratification as shown in Table S1 of the Supplementary Material. A total of 263 scans were included, with 150 scans reporting no nodules and 113 scans mentioning nodules. Thereafter the included scans were processed by the AI-based CAD and read by the first reader. The first reader was blinded to the report and was tasked to annotate nodules and assign their characteristics including texture, presence of calcification/spiculation and location. These annotations were compared with the radiology reports. In 230 scans the annotations reconciled the radiology reports while 33 cases exhibited discrepancies. A second radiologist reviewed all 33 cases using the radiology report and the first reader’s annotations. Eighteen of the disagreement scans turn out to be nodule-containing scans while 15 are non-nodule-containing scans. Neither reader had access to the CAD’s output during this process

Out of the 263 scans, 121 exhibited at least one nodule, while the remaining 142 did not contain any nodules. Table 1 shows the demographic, clinical and CT acquisition parameters in nodule-containing and non-nodule-containing scans. As anticipated, within the group of patients aged < 55 years, the percentage of nodule-containing scans were lower compared to patients aged ≥ 55 years (23.5% vs 60.2%). Furthermore, no significant differences were observed in gender, machine type, and scan type between the nodule- and non-nodule-containing scan group (p = 0.5, p = 0.3 and p = 0.2, respectively).

Table 1 Clinical, demographic, and CT acquisition variables in nodule (n = 121) and non-nodule scans (n = 142)

Scan-level

Out of the 121 nodule scans, 104 were correctly flagged as having nodules; of the 142 non-nodule-containing, 121 scans were correctly not flagged for nodules. CAD demonstrated a scan-level sensitivity of 104/121 (86.0%; CI: 78.6%–91.0%), specificity of 121/142 (85.2%; CI: 78.4%–90.1%) and precision of 104/125 (83.2%; CI: 75.7–88.7%). The AUC-ROC (Fig. 2A) was 0.865 (CI:0.837–0.892), AUC-PR (Fig. 2B) was 0.844 and F1-score was 0.846 (CI: 0.794–0.888), resulting in an overall FPPI of 21/121 (0.174) per scan.

Fig. 2
figure 2

A Receiver operating characteristics curve and (B) Precision recall curve: on the left (A), the ROC curve of the CAD system is depicted, the AUC-ROC is equal to 0.865 (CI:0.837–0.892). On the right (B), the PR curve of the concerned system is shown, the AUC-PR is equal to 0.844. Note that sensitivity is also known as recall

Nodule-level (detection)

One hundred and eighy-three nodules were identified on 121 nodule scans according to the reference standard, yielding an average of 1.51 nodules per nodule-containing scan. In these patients, the CAD system detected 149 nodules with a mean IoU of 0.417 (CI: 0.396–0.438) and missed 34 nodules. Forty-nine FP findings were obtained within the nodule scans, whereas 23 FP findings were observed within the non-nodule-containing scans. The CAD detection system had a nodule-level sensitivity of 149/183 (81.4%; CI: 76.2%–86.6%) with an average of 49/121 (0.405) FPs per nodule-containing examination. The FROC plot in Fig. 3 shows the sensitivity of the CAD-system at various FP-rates. The sensitivity by subgroups is reported in Table 2. The nodule-level sensitivity does not appear to change considerably among the subgroups. However, within the machine type subgroup an outlier in the group “Others” is observed with a 50.0% sensitivity, nevertheless this discrepancy might be attributed to the small sample size of 4.

Fig. 3
figure 3

Free response operating characteristic plot: Average false positive rate on the x-axis and the corresponding sensitivity on the y-axis

Table 2 Subgroup nodule-level detection sensitivity of the total 183 GT nodules

The detection sensitivity per texture type was 82/94 (87.2%), 42/47 (89.4%) and 25/42 (59.5%) for solid, part-solid and ground-glass, respectively (Table 3).

Table 3 GT-nodule characteristics among CAD detected (n = 149) and missed nodules (n = 34)

Nodule-level (characterization)

In order to evaluate the accuracy of the CAD system in characterizing each nodule, we only focused on those true nodules. Table 3 displays the distribution of GT-nodule characteristics among CAD detected and missed nodules. A total of 149 nodules were detected and 34 nodules were missed. Remarkably, half of the missed nodules (17) are ground-glass nodules. The distribution of calcification and spiculation is comparable between detected and missed nodules. Furthermore, it is noteworthy that a substantial portion of the missed nodules is localized in the left lower lobe (38.4%).

Characterization sensitivity of CAD was computed for the correctly detected nodules (Table 4). The overall accuracy was: texture 116/149 (77.2%; CI: 69.8–83.2%), calcification 137/149 (91.9%; CI: 86.5–95.3%), spiculation 123/149 (82.6%; CI: 75.7–87.8%), and location 141/149 (94.6%; CI: 89.8–97.3%).

Table 4 Characterization accuracy of the detected nodules by the CAD-system (n = 149)

The overall mean absolute error in diameter was 0.57 mm (CI: 0.38 mm–0.76 mm). When stratified by diameter categories of < 6 mm, 6–8 mm, and > 8 mm, the errors were 0.37 mm (CI: 0.23 mm–0.52 mm), 0.56 mm (CI: 0.23 mm–0.89 mm), and 0.96 mm (CI: 0.39 mm–1.53 mm), respectively.

For volume, the overall mean absolute error was 147.14 mm3 (CI: 97.97 mm3–206.31 mm3). Stratified by volume categories of < 100 mm3, 100–250 mm3, and > 250 mm3, the errors were 16.97 mm3 (CI: 9.15 mm3–24.79 mm3), 32.60 mm3 (CI: 21.92 mm3–43.28 mm3), and 249.67 mm3 (CI: 145.87 mm3–353.47 mm3), respectively.

The bias and limits of agreement for diameter and volume are reported in Fig. 4 and Fig. 5, respectively.

Fig. 4
figure 4

Bland-Altman plot for diameter: The Bland-Altman plot illustrates the agreement between the ground truth and predicted diameter. The mean of the differences (bias) and the limits of agreement (LoA), along with their 95%CI, are depicted. The mean difference between the two values is 0.47 mm (CI: 0.27 mm–0.68 mm). The bias is represented by a blue area, with a dashed line indicating the point estimate of the bias and a dotted line representing the corresponding 95%CI. The upper LoA is 2.90 mm (CI: 2.56 mm–3.25 mm). The upper LoA is illustrated by a green area, with a dashed line indicating the point estimate and a dotted line representing the 95%CI. The lower LoA is -1.95 mm (CI: -2.30 mm to -1.61 mm). The lower LoA is depicted by a red area, with a dashed line indicating the point estimate and a dotted line representing the 95%CI

Fig. 5
figure 5

Bland-Altman plot for volume: The Bland-Altman plot illustrates the agreement between the ground truth and predicted volume. The mean of the differences (bias) and the limits of agreement (LoA), along with their 95% confidence intervals (CI), are depicted. The mean difference between the two values is 61.8 mm3 (CI: -1.68 mm3–125 mm3). The bias is represented by a blue area, with a dashed line indicating the point estimate of the bias and a dotted line representing the corresponding 95%CI. The upper LoA is 830 mm3 (CI: 722 mm3–939 mm3). The upper LoA is illustrated by a green area, with a dashed line indicating the point estimate and a dotted line representing the 95%CI. The lower LoA is -707 mm3 (CI: -815 mm3 to -598 mm3). The lower LoA is depicted by a red area, with a dashed line indicating the point estimate and a dotted line representing the 95%CI

Analysis conflicting scans

In the 33 conflicting scans (18 nodule-containing and 15 non-nodule-containing scans), R1’s interpretations did not align with the radiology reports. Following adjudication, 27 true nodules were identified within these scans. The CAD-system detected 19 true nodules and 10 false nodules. R1 identified 9 true nodules but also marked 39 FPs, however it is important to note that these 39 findings are related to the total dataset of 263 scans. Table 5 summarizes characteristics of the 27 true nodules within the disagreement scans and provides an overview of the nodules found and/or missed by R1 and CAD. A total of 165 true nodules were identified by R1, yielding a sensitivity of 165/183 (90.2%).

Table 5 GT characteristics of the total number of lung nodules (n = 27), lung nodules only identified by R1 (n = 2), lung nodules only detected by CAD (n = 12), mutually found lung nodules by R1 and CAD (n = 7) and lung nodules not found by R1 nor CAD (n = 6) within the disagreement scans

Discussion

This external validation study evaluated the standalone performance of a commercial AI-based CAD system designed to automatically detect and characterize solid, part-solid, and ground-glass LNs in CT scans.

Based on the results of the reference standard, a total of 183 true nodules were identified. CAD successfully detected 149 nodules, achieving a detection sensitivity of 81.4%, which is slightly lower than R1’s detection performance (165 nodules with a sensitivity of 90.2%). Two out of 27 nodules within the discrepant scan findings were found by R1 but were missed by the CAD system, and 12 nodules were only detected by the CAD system. Although CAD detects not more nodules than R1, the system has the potential to increase the sensitivity of R1 by at least 7% (12/165) when used as the second reader. We believe that the primary benefit of deploying CAD lies in detecting LNs in the lower lobe regions as well as small nodules.

In the assessment of CAD’s performance, 72 findings were labelled as FPs and 34 findings as FNs. Nodule detection is an exceedingly intricate process, primarily owing to the complexity of nodules themselves, which in turn results in a notable incidence of FPs and FNs. FPs may arise from intrapulmonary lymph nodes and granulomas, while FNs can result from nodules situated adjacent to the pleural fissure or subtle nodules (e.g., small nodules or GGNs) that evade detection. Moreover, bronchovascular structures and technical artefacts further compound the risk of both FPs and FNs. Figure 6 illustrates several examples of FP and FN findings within our study.

Fig. 6
figure 6

Examples of FP and FN findings: On the left (A), three examples of false positives (FP) are depicted. According to our reviewing radiologists, the top one represents an obvious blood vessel in the middle lobe; the middle CT scan contains slight motion artefacts, likely resulting in the incorrect flagging of a blood vessel in the left lower lobe; the bottom one is a perifissural lymph node. On the right (B), three false negatives (FN) are shown, these were missed by the CAD-system. The top image shows a part-solid nodule in the right lower lobe; the middle image displays a ground-glass nodule, also located in the right lower lobe; the bottom image features a subtle, calcified solid nodule situated in the right lower lobe

The overall detection sensitivity achieved by the proposed CAD system is comparable to or slightly higher than the nodule-level sensitivity observed in other commercially available CAD algorithms. Lo et al (2018), for example, evaluated a commercial CAD system using scans from the National Lung Screening Trial. In this study, a nodule-level sensitivity of 82% was achieved at 0.75 FPPI [26]. The performance of another commercial CAD system was assessed in the study of Murchison et al (2022) using data from the United Kingdom. In this study, a sensitivity of 82.3% was reported at 1 FPPI, reaching a maximum sensitivity of 95.9% at an average FP-rate of 10.9 [27].

Besides detection performance, the characterization sensitivity of the system was also assessed in our study. Traditionally characterization relies on the subjective judgments of the evaluating radiologist, a process susceptible to variations from person-to-person [28,29,30,31,32]. This evaluation is crucial in determining the subsequent steps for patient care [13, 15, 18, 19]. Characterization using CAD can help to achieve a more objective approach for nodule assessment. In our study, an overall mean absolute error of 0.57 mm and 147.1 mm3 for diameter and volume, respectively, was measured between the GT nodule and CAD predicted nodule, which is considered acceptable and comparable with inter-reader variability [28,29,30,31,32]. The overall sensitivity for the other characterizations exceeded 80% (91.9% for calcification; 82.6% for spiculation and 94.6% for location). However, the accuracy for texture fell slightly below (77.2%), especially when considering ground-glass and part-solid nodules, current performance stands at 72.0% and 38.1%, respectively, indicating notable room for improvement. The technical challenge posed by both texture types is twofold: first, they need to be detected which is a significant difficulty in itself. Our study revealed 17/34 (50%) of missed nodules were ground-glass nodules (Table 3). Once detection is achieved, the subsequent step involves characterization. Given the technical challenge of detecting and distinguishing the textures, numerous AI-based CADs lack this essential functionality of characterization.

To date, this is the first study utilizing a commercially available CAD system capable of automatically detecting and characterizing all three LN texture types. Also, almost all studies primarily emphasize detection performance rather than characterization. Hence no comparison with other commercial tools is feasible. Nonetheless, there is one study which validated the characterization performance of another commercial CAD, however this system can solely differentiate between solid and subsolid nodules. The study reported a classification sensitivity of 98.8% for solid nodules and 68.4% for subsolid nodules [27].

One limitation of our study is that re-evaluation by the adjudicator was limited to the 33 conflicting scans. Therefore, one may question whether all 72 FP findings of CAD are correctly classified, since theoretically some may have been overlooked by R1 and the initial radiology report. Another drawback is that we did not provide precise data on the operation and computation time of the CAD system due to the absence of real-time clinical scenarios and the retrospective nature of our study. Furthermore, the exclusion of CT examinations containing artefacts and/or > 10 LNs make the selected scans not fully representative of real-world clinical scenarios. Nevertheless, it’s important to acknowledge that artefacts in scans also pose challenges for radiologists. Additionally, in practice, radiologists typically assess a maximum of 5–10 nodules per scan [33]. Despite these limitations, the main scope of our study was to determine the standalone performance of CAD using external data rather than to determine its impact in a real-world setting.

While the CAD tool shows comparable nodule detection to a single read, drawbacks, including a high FP-rate and the inability for rational decision-making based on the characterization and clinical context, highlight the preference of experienced radiologists to review scans independently. Increased FP-rate and inadequate classifications demands additional effort for radiologists to re-evaluate scans. Moreover, a significant portion of detected lesions proves to be benign upon further examination, emphasizing the essential role of radiologists in decision-making for follow-up procedures [34, 35].

Future research could compare radiologists’ performance in detecting and characterizing LNs with and without CAD-system support. In this way, we can assess whether radiologists’ LN detection-rates exhibit improvement and whether the characterization of nodules can be assessed more objectively when aided by an algorithm. Additionally, this investigation could elucidate the broader implications of integrating CAD into clinical practice, encompassing its effects on workflow optimization, diagnostic precision, and treatment decisions [36]. Additionally, exploration of cost-effectiveness, employing appropriate study designs, is imperative to demonstrate its economic viability within healthcare systems.

Conclusion

In conclusion, the proposed commercially available CAD system exhibited detection performance slightly below that of a single reader but marginally better than or comparable to other commercially available LN detection applications. CAD detects different LNs compared to a single reader, indicating its potential to enhance a radiologist’s performance when employed as a second reader. While CAD shows promising characterization performance in identifying nodule size, location, spiculation status and calcification status, further refinements in texture classification are required.