Data
This study used the LIDC/IDRI data set [10], consisting of 1,018 helical thoracic CT scans collected retrospectively from seven academic centres. Nine cases with inconsistent slice spacing or missing slices were excluded. In addition, 121 CT scans, which had a section thickness of 3 mm and higher, were excluded since thick section data is not optimal for CAD analysis. This resulted in 888 CT cases available for evaluation. In Tables 1, 2, and 3, the characteristics of the input data are shown.
Table 1 Manufacturer and scanner model distribution of the 888 CT scans in our dataset
Table 2 Section thickness distribution of the 888 CT scans in our dataset
Table 3 Distribution of the reconstruction kernels used for the 888 CT scans in our dataset
LIDC/IDRI image annotation
The LIDC/IDRI employed a two-phase image annotation process [10]. In the first phase (the blind phase), four radiologists independently reviewed all cases. In the second phase (the unblinded phase), all annotations of the other three radiologists were made available and each radiologist independently reviewed their marks along with the anonymized marks of their colleagues. Findings were annotated and categorized into nodule≥3 mm, nodule<3 mm, or non-nodule. Non-nodule marks were used to indicate abnormalities in the scan, which were not considered a nodule. Using this two-phase process, the LIDC investigators aimed to identify as completely as possible all lung nodules, without forcing consensus among the readers. More details about the annotation process can be found in [10]. An XML file with the annotations is publicly available for every case.
Nodule selection and purpose
In this study, we included all annotations available in the XML files for the 888 scans. The focus of this study was on the nodule≥3 mm group. As a result of the LIDC/IDRI image annotation process, each nodule≥3 mm had been annotated by one, two, three, or four radiologists. In total, the data set of this study included 777 locations, which were marked as nodule≥3 mm by all four radiologists. The 777 nodule≥3 mm annotations marked by all four radiologists can be categorized by size as follows: 22 nodules <4 mm, 228 nodules 4–6 mm, 199 nodules 6–8 mm, and 328 nodules >8 mm. The number of nodules per scan ranged between 1 and 8.
The purpose of this study was twofold. First, we aimed to assess the performance of three state-of-the-art nodule CAD systems. Secondly, we performed an observer experiment to investigate whether CAD can find additional lesions, missed during the extensive LIDC annotation process.
CAD systems
Three CAD systems were used: a commercial CAD system Visia (MeVis Medical Solutions AG, Bremen, Germany), a commercial prototype CAD system Herakles (MeVis Medical Solutions AG, Bremen, Germany), and an academic nodule CAD system ISICAD (Utrecht Medical Center, Utrecht, the Netherlands) [11]. ISICAD was the leading academic CAD system in the ANODE09 nodule detection challenge [9]. For all three CAD systems, a list of candidate marks per CT scan was obtained. Each CAD candidate is described by a 3D location. Additionally, Herakles and ISICAD also provide a CAD score per CAD candidate. The CAD score is the output of the internal classification scheme of the CAD system and is a measure of the likelihood that a candidate is a pulmonary nodule. An internal threshold on the CAD scores determines which candidates are active CAD marks and, hence, will be shown to the user, and which candidates are not shown. Since different thresholds can be applied on the CAD score, a CAD system can have multiple operating points. A low threshold generates more CAD marks, thereby typically increasing sensitivity at the cost of more false positive detections. A high threshold will generate less false positives but may reduce the sensitivity of a CAD system. For all three CAD systems, one fixed operating point is internally set which we will refer to as the system operating point.
Evaluation
The performance of the CAD systems was analyzed on the set of 777 nodules annotated by 4/4 radiologists as a nodule≥3 mm. We employed free-response operating characteristic (FROC) analysis [12] where detection sensitivity is plotted against the average number of false positive detections per scan. Confidence intervals were estimated using bootstrapping with 5,000 iterations [13]. If a CAD system marked locations which were annotated by three or fewer radiologists as nodule≥3 mm, as nodule<3 mm, and as non-nodules, these CAD marks were counted as false positives. For Visia, no CAD scores were available for the CAD candidates. Consequently, only one operating point and not a full FROC curve could be generated for Visia.
To gain more insight into which type of nodules were missed by CAD, we looked at the characteristics, as scored by the LIDC readers, for all nodule≥3 mm findings, of the false negatives. We defined subsolid nodules as nodules for which the majority of the radiologists gave a texture score smaller than 5 (1=ground-glass/non-solid, 3=part-solid, 5=solid). Subtle nodules were defined as nodules for which the majority of the radiologists gave a subtlety score smaller or equal than 3 (1=extremely subtle, 5=obvious).
To assess the robustness of the CAD algorithms, we also evaluated the CAD results on different subsets of the data. The LIDC-IDRI data set is a heterogeneous set of CT scans and CAD algorithms that could conceivably exhibit a different performance on different types of data. We analyzed the following factors: (1) presence of contrast material, i.e., non-contrast versus contrast enhanced scans, (2) section thickness, i.e., cases with section thickness <2 mm versus section thickness ≥2 mm, and (3) reconstruction kernel, i.e., soft/standard versus enhancing/overenhancing kernels.
Observer study
In order to evaluate whether CAD can find lesions missed during the extensive annotation process of the LIDC/IDRI database, we considered the CAD marks of the best CAD algorithm , which were counted as false positives at its system operating point. Two conditions were differentiated: the location of the CAD mark had in fact been marked in the LIDC annotation process, but not by all four readers as nodule≥3 mm as warranted for being counted as a true positive. The second condition comprised those CAD marks that had no corresponding LIDC marks at all. The CAD marks corresponding to the first condition can be subdivided according to the LIDC readings. The latter CAD marks were independently inspected by four chest radiologists, since these are potentially nodules overlooked by all four LIDC readers. Thus, we mimic the original LIDC annotation process as though CAD had been included as another independent reader in the first phase of the image annotation process. CAD marks were categorized as nodule≥3 mm, nodule<3 mm, non-nodules, or false positive. Electronic measurement tools were available to measure size. To reduce the workload for the radiologists, a research scientist (5 years experience in nodule CAD research) first removed the marks which were obviously not a nodule. CAD marks which were marked as nodule>3 mm by all four radiologists in our study were independently evaluated by an experienced radiologist that scored subtlety, location, type, and attachment to other structures. Subtlety was scored on a five-point scale (1=extremely subtle, 5=obvious).